llama : fix K-shift with quantized K (wip) #5653

slaren · 2024-02-21T23:33:19Z

Opening this as a proof of concept of a possible solution. It should work, but it requires implementing a quant -> F32 ggml_cpy op in the backends.

ggerganov · 2024-02-22T08:51:43Z

Yup, we should do that. Having Q -> F32 will be useful anyway. Though it's not very high-prio IMO

ngxson · 2024-02-22T13:09:30Z

Thanks for having looked into this. I understand that it's not our priority for the moment, so no problem.

I can confirm that this PR resolve the problem in mentioned in my issue, but throw another error on ggml_compute_forward_dup (which is expected for now, since we still need some changes in ggml backend)

vonjackustc · 2024-03-18T06:30:31Z

Add cpy fp16 to q8_0 and q8_0 to fp16:
3d92acf

Test on M2 pro (metal backend).

I'm not familiar with CUDA, so pls check.

slaren · 2024-03-18T10:05:23Z

There are already dequantization kernels, it would be better to reuse these instead of duplicating the code.

ghost · 2024-09-16T07:48:59Z

Is this hard to implement? Would be very nice to allow K-shift for quantized KV-cache.

ghost · 2024-09-18T09:49:21Z

Okay, I tried to experiment with code in this PR, but it allocates too much CUDA memory. Any hint? I suppose it happens because of tmp tensors for each layer at once.

ngxson · 2024-09-18T09:56:24Z

I think the better solution should be to have kernels for ggml_rope_custom_inplace to support quantized tensors. But this will be complicated to implement.

The current problem is that we use quantized KV cache because there is not enough memory to store dequantized tensors. What we currently doing here is to dequantize, RoPE, then quantize back, which definitely use more memory.

ghost · 2024-09-18T11:23:33Z

But why does it need so much memory at once? Is it because each layer's K-shift is computed in parallel?

slaren · 2024-09-18T11:31:12Z

It allocates a temporary tensor to hold a copy of the K cache in one layer in f32, which can be substantial with a large context size. However there is no need to convert the entire K cache at once, it could be split into smaller parts to reduce the size of the temporary tensor.

ghost · 2024-09-18T11:51:14Z

The amount of memory in the OOM message looked like it would fit KV cache for all layers in F32. I tried to "reuse" tmp tensor from previous layers in the loop, which seemingly fixed the issue with allocation, but then I hit another issue, lack of implementation of ggml_compute_forward_cpy which is also a little strange (I copied implementation for CUDA, but I didn't expect CPU backend to kick in).

Could be that I messed something up though.

slaren · 2024-09-18T11:58:54Z

The amount of memory in the OOM message looked like it would fit KV cache for all layers in F32.

That's unexpected, and definitely should not need to reuse the tensor manually, ggml-alloc should take care of that.

Did you update the ggml_backend_cuda_supports_op function so that it reports that it supports this type of copy? Otherwise, the ggml_cpy operation will be performed in the CPU. You can see which backend is being used for each operation by setting the GGML_SCHED_DEBUG env variable.

ghost · 2024-09-18T12:33:39Z

Thanks, I indeed had not updated ggml_backend_cuda_supports_op. Now it's progressing a bit, I get "ROPE failed" error with reason "out of memory".

ghost · 2024-09-18T15:55:21Z

And... I managed to make it work, with a hack.

I hardcoded the layer which corresponds to tensor split amongs my two GPUs. Without this hack, something strange happens. I can try to upload sched logs later.

slaren · 2024-09-20T00:13:09Z

The scheduler may have trouble figuring the best place to put the tensors since there are no weights. ggml_backend_sched_set_tensor_backend could be used on each layer to force the operations to run on the backend the KV cache of the layer is allocated in.

llama : fix K-shift with quantized K (wip)

5271c75

slaren mentioned this pull request Feb 21, 2024

llama_kv_cache_seq_shift does not work with cache type q4_0 #5652

Closed

slaren linked an issue Feb 21, 2024 that may be closed by this pull request

llama_kv_cache_seq_shift does not work with cache type q4_0 #5652

Closed

ghost mentioned this pull request Sep 20, 2024

CUDA: Enable K-shift operation for -ctk q8_0 (limited) #9571

Merged

4 tasks

mengqin mentioned this pull request Dec 12, 2024

KV Cache quants run into issues every couple of messages. ollama/ollama#7938

Closed

ggerganov mentioned this pull request Apr 25, 2025

Misc. bug: Unsupported op "CPY" / SIGABRT on Apple CPU #13112

Closed

slaren closed this Apr 25, 2025

llama : fix K-shift with quantized K (wip) #5653

llama : fix K-shift with quantized K (wip) #5653

Uh oh!

Conversation

slaren commented Feb 21, 2024

Uh oh!

ggerganov commented Feb 22, 2024

Uh oh!

ngxson commented Feb 22, 2024

Uh oh!

vonjackustc commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Mar 18, 2024

Uh oh!

ghost commented Sep 16, 2024

Uh oh!

ghost commented Sep 18, 2024

Uh oh!

ngxson commented Sep 18, 2024

Uh oh!

ghost commented Sep 18, 2024

Uh oh!

slaren commented Sep 18, 2024

Uh oh!

ghost commented Sep 18, 2024

Uh oh!

slaren commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Sep 18, 2024 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Sep 18, 2024

Uh oh!

slaren commented Sep 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vonjackustc commented Mar 18, 2024 •

edited

Loading

slaren commented Sep 18, 2024 •

edited

Loading

ghost commented Sep 18, 2024 •

edited by ghost

Loading