-
Notifications
You must be signed in to change notification settings - Fork 13.4k
llama : fix K-shift with quantized K (wip) #5653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Yup, we should do that. Having Q -> F32 will be useful anyway. Though it's not very high-prio IMO |
|
Thanks for having looked into this. I understand that it's not our priority for the moment, so no problem. I can confirm that this PR resolve the problem in mentioned in my issue, but throw another error on |
|
Add cpy fp16 to q8_0 and q8_0 to fp16: Test on M2 pro (metal backend). I'm not familiar with CUDA, so pls check. |
|
There are already dequantization kernels, it would be better to reuse these instead of duplicating the code. |
|
Is this hard to implement? Would be very nice to allow K-shift for quantized KV-cache. |
|
Okay, I tried to experiment with code in this PR, but it allocates too much CUDA memory. Any hint? I suppose it happens because of tmp tensors for each layer at once. |
|
I think the better solution should be to have kernels for The current problem is that we use quantized KV cache because there is not enough memory to store dequantized tensors. What we currently doing here is to dequantize, RoPE, then quantize back, which definitely use more memory. |
|
But why does it need so much memory at once? Is it because each layer's K-shift is computed in parallel? |
|
It allocates a temporary tensor to hold a copy of the K cache in one layer in f32, which can be substantial with a large context size. However there is no need to convert the entire K cache at once, it could be split into smaller parts to reduce the size of the temporary tensor. |
|
The amount of memory in the OOM message looked like it would fit KV cache for all layers in F32. I tried to "reuse" tmp tensor from previous layers in the loop, which seemingly fixed the issue with allocation, but then I hit another issue, lack of implementation of |
That's unexpected, and definitely should not need to reuse the tensor manually, ggml-alloc should take care of that. Did you update the |
|
Thanks, I indeed had not updated |
|
The scheduler may have trouble figuring the best place to put the tensors since there are no weights. |


Opening this as a proof of concept of a possible solution. It should work, but it requires implementing a quant -> F32
ggml_cpyop in the backends.