Skip to content

Conversation

@slaren
Copy link
Member

@slaren slaren commented Feb 21, 2024

Opening this as a proof of concept of a possible solution. It should work, but it requires implementing a quant -> F32 ggml_cpy op in the backends.

@ggerganov
Copy link
Member

Yup, we should do that. Having Q -> F32 will be useful anyway. Though it's not very high-prio IMO

@ngxson
Copy link
Collaborator

ngxson commented Feb 22, 2024

Thanks for having looked into this. I understand that it's not our priority for the moment, so no problem.

I can confirm that this PR resolve the problem in mentioned in my issue, but throw another error on ggml_compute_forward_dup (which is expected for now, since we still need some changes in ggml backend)

@vonjackustc
Copy link

vonjackustc commented Mar 18, 2024

Add cpy fp16 to q8_0 and q8_0 to fp16:
3d92acf

Test on M2 pro (metal backend).

I'm not familiar with CUDA, so pls check.

@slaren
Copy link
Member Author

slaren commented Mar 18, 2024

There are already dequantization kernels, it would be better to reuse these instead of duplicating the code.

@ghost
Copy link

ghost commented Sep 16, 2024

Is this hard to implement? Would be very nice to allow K-shift for quantized KV-cache.

@ghost
Copy link

ghost commented Sep 18, 2024

Okay, I tried to experiment with code in this PR, but it allocates too much CUDA memory. Any hint? I suppose it happens because of tmp tensors for each layer at once.

@ngxson
Copy link
Collaborator

ngxson commented Sep 18, 2024

I think the better solution should be to have kernels for ggml_rope_custom_inplace to support quantized tensors. But this will be complicated to implement.

The current problem is that we use quantized KV cache because there is not enough memory to store dequantized tensors. What we currently doing here is to dequantize, RoPE, then quantize back, which definitely use more memory.

@ghost
Copy link

ghost commented Sep 18, 2024

But why does it need so much memory at once? Is it because each layer's K-shift is computed in parallel?

@slaren
Copy link
Member Author

slaren commented Sep 18, 2024

It allocates a temporary tensor to hold a copy of the K cache in one layer in f32, which can be substantial with a large context size. However there is no need to convert the entire K cache at once, it could be split into smaller parts to reduce the size of the temporary tensor.

@ghost
Copy link

ghost commented Sep 18, 2024

The amount of memory in the OOM message looked like it would fit KV cache for all layers in F32. I tried to "reuse" tmp tensor from previous layers in the loop, which seemingly fixed the issue with allocation, but then I hit another issue, lack of implementation of ggml_compute_forward_cpy which is also a little strange (I copied implementation for CUDA, but I didn't expect CPU backend to kick in).

image
Could be that I messed something up though.

@slaren
Copy link
Member Author

slaren commented Sep 18, 2024

The amount of memory in the OOM message looked like it would fit KV cache for all layers in F32.

That's unexpected, and definitely should not need to reuse the tensor manually, ggml-alloc should take care of that.

Did you update the ggml_backend_cuda_supports_op function so that it reports that it supports this type of copy? Otherwise, the ggml_cpy operation will be performed in the CPU. You can see which backend is being used for each operation by setting the GGML_SCHED_DEBUG env variable.

@ghost
Copy link

ghost commented Sep 18, 2024

Thanks, I indeed had not updated ggml_backend_cuda_supports_op. Now it's progressing a bit, I get "ROPE failed" error with reason "out of memory".

@ghost
Copy link

ghost commented Sep 18, 2024

And... I managed to make it work, with a hack.

image
I hardcoded the layer which corresponds to tensor split amongs my two GPUs. Without this hack, something strange happens. I can try to upload sched logs later.

@slaren
Copy link
Member Author

slaren commented Sep 20, 2024

The scheduler may have trouble figuring the best place to put the tensors since there are no weights. ggml_backend_sched_set_tensor_backend could be used on each layer to force the operations to run on the backend the KV cache of the layer is allocated in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

llama_kv_cache_seq_shift does not work with cache type q4_0

4 participants