Skip to content

Misc. bug: KV Cache seems to be initialized twice for the draft model? #12436

@Mushoz

Description

@Mushoz

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 4906 (60c9029)
built with cc (GCC) 14.2.1 20250207 for x86_64-pc-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server
      --port 9002
      --metrics
      --slots
      -m /models/Qwen_QwQ-32B-IQ4_XS.gguf
      -ngl 999
      --ctx-size 32768
      --no-context-shift
      -fa
      -ctv q8_0
      -ctk q8_0
      -md /models/Qwen2.5-0.5B-Instruct-IQ4_XS.gguf
      -ngld 99
      --draft-p-min 0.5
      --draft-min 0
      --draft-max 15

Problem description & steps to reproduce

When I run llama-server in the output the KV cache for the main model is initialized first as expected:

llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 64, can_shift = 1
init:      ROCm0 KV buffer size =  4352.00 MiB
llama_context: KV self size  = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
llama_context:      ROCm0 compute buffer size =   325.08 MiB
llama_context:  ROCm_Host compute buffer size =    74.01 MiB
llama_context: graph nodes  = 1991
llama_context: graph splits = 2

However, then the KV cache for the draft model is initialized and the output seems to be double:

llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context:  ROCm_Host  output buffer size =     0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 24, can_shift = 1
init:      ROCm0 KV buffer size =   204.00 MiB
llama_context: KV self size  =  204.00 MiB, K (q8_0):  102.00 MiB, V (q8_0):  102.00 MiB
llama_context:      ROCm0 compute buffer size =   300.26 MiB
llama_context:  ROCm_Host compute buffer size =    65.76 MiB
llama_context: graph nodes  = 751
llama_context: graph splits = 50
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 32768
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context:  ROCm_Host  output buffer size =     0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
init:      ROCm0 KV buffer size =   384.00 MiB
llama_context: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_context:      ROCm0 compute buffer size =   300.25 MiB
llama_context:  ROCm_Host compute buffer size =    65.76 MiB
llama_context: graph nodes  = 751
llama_context: graph splits = 2

Interestingly, the first output shows it being initialized with q8_0 quantization, the same as what I am using for the main model, while the second output shows FP16 KV cache. My questions are:

  1. Is the output correct, and is it really initializing the KV cache for the draft model twice? Or is this a display error?
  2. Is it using KV quantization for the draft model or not? Because the log output is contradicting itself

I am using QwQ-32B as the main model, and using Qwen2.5-0.5B-Instruct with QwQ's tokenizer to ensure I can use it as a draft model. I am seeing a good ~1.5-2x speedup, so things seem to be working fine. But if it's really initializing the draft model twice, it might mean it's wasting VRAM for no reason.

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions