Kv_unified, n_seq_max #17421
Unanswered
RedDragonGecko
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Okay, preamble.
I'm been running GLM 4.6 (UD-Q5_K_XL) since its release. I hadn't updated to a new build of llama.cpp since.
I recently had a long conversation going, previously having capped my context at 32k, I hit that limit. So I expanded my context to 64k. I also removed "-ctk q8_0 -ctv q8_0" to see how that would affect performance.
With this loaded I got a couple messages further when it started to repeat a single character. ????????????? forever. Changing Top K would change that character ////////////// for instance. But nothing I tried would stop it.
So my first thought was to try a different quant of the model. Loaded up the messages and exact same behavior. Second thought was to upgrade to the latest build of llama.cpp
Cannot load, runs out of vram with the same settings. I looked at what was happening and discovered;
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 64000
llama_context: n_ctx_seq = 64000
llama_context: n_batch = 4096
llama_context: n_ubatch = 4096
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (64000) < n_ctx_train (202752) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 2.31 MiB
llama_kv_cache: CUDA0 KV buffer size = 9000.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 4250.00 MiB
llama_kv_cache: CUDA2 KV buffer size = 4250.00 MiB
llama_kv_cache: CUDA3 KV buffer size = 4250.00 MiB
llama_kv_cache: CUDA4 KV buffer size = 1250.00 MiB
llama_kv_cache: size = 23000.00 MiB ( 64000 cells, 92 layers, 4/1 seqs), K (f16): 11500.00 MiB, V (f16): 11500.00 MiB
Now, I'm not sure what they do but I haven't found a way to turn them back off to test if they are causing the increase to vram usage. Looking at my older version load they are set to 1 and false respectively.
Beta Was this translation helpful? Give feedback.
All reactions