Feature Request: Vulkan: Implement CPY op for quantized types

### Prerequisites

- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

This is mostly related to ggml, but I was advised to report the issue here. 

Basically, this would require implementing quantization shaders for Vulkan (that's the easy part), and supporting them in the cpp code.

### Motivation

With stable-diffusion.cpp compiled with Vulkan backend, when attempting to load a lora on a quantized model (any non float type), the program prints `Missing CPY op for types: f32 q8_0` (for example) and crashes at [this line](https://github.com/ggerganov/ggml/blob/master/src/ggml-vulkan/ggml-vulkan.cpp#L3685).

Having more ops implemented is a good thing, especially if it fixes a crash downstream. 

### Possible Implementation

I'm guessing something like this for the shaders (q8_0):
```glsl
#version 450

#include "quant_head.comp" //do not exixt

layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;

layout (binding = 0) readonly buffer A {float data_a[];};
layout (binding = 1) writeonly buffer D {block_q8_0 data_b[];};

void main() {
    const uint i = gl_WorkGroupID.x * 4 + gl_LocalInvocationID.x / 64;

    const uint tid = gl_LocalInvocationID.x % 64;
    const uint il = tid / 32;
    const uint ir = tid % 32;
    const uint ib = 32 * i + ir;
    if (ib >= p.nel / 32) {
        return;
    }

    const uint b_idx = 1024 * i + 32 * ir + 16 * il;

    float absmax = 0.0;
    [[unroll]] for (uint j = 0; j < 32; ++j) {
        absmax = max(absmax, abs(data_a[b_idx + j]));
    }
 
    float d= absmax / 127.0;
   float id = d != 0. ? 1./d : d;
    data_b[ib].d = float16_t(d);
    [[unroll]] for (uint j = 0; j < 32; ++j) {
        data_b[ib].qs[16 * il + j] = uint8_t(clamp(data_a[b_idx + j] * id, -128.0, 127.0));
    }
}
```
I don't know how to proceed further in the implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Vulkan: Implement CPY op for quantized types #11127

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Vulkan: Implement CPY op for quantized types #11127

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions