- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
This is mostly related to ggml, but I was advised to report the issue here.
Basically, this would require implementing quantization shaders for Vulkan (that's the easy part), and supporting them in the cpp code.
Motivation
With stable-diffusion.cpp compiled with Vulkan backend, when attempting to load a lora on a quantized model (any non float type), the program prints Missing CPY op for types: f32 q8_0 (for example) and crashes at this line.
Having more ops implemented is a good thing, especially if it fixes a crash downstream.
Possible Implementation
I'm guessing something like this for the shaders (q8_0):
#version 450
#include "quant_head.comp" //do not exixt
layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
layout (binding = 0) readonly buffer A {float data_a[];};
layout (binding = 1) writeonly buffer D {block_q8_0 data_b[];};
void main() {
    const uint i = gl_WorkGroupID.x * 4 + gl_LocalInvocationID.x / 64;
    const uint tid = gl_LocalInvocationID.x % 64;
    const uint il = tid / 32;
    const uint ir = tid % 32;
    const uint ib = 32 * i + ir;
    if (ib >= p.nel / 32) {
        return;
    }
    const uint b_idx = 1024 * i + 32 * ir + 16 * il;
    float absmax = 0.0;
    [[unroll]] for (uint j = 0; j < 32; ++j) {
        absmax = max(absmax, abs(data_a[b_idx + j]));
    }
 
    float d= absmax / 127.0;
   float id = d != 0. ? 1./d : d;
    data_b[ib].d = float16_t(d);
    [[unroll]] for (uint j = 0; j < 32; ++j) {
        data_b[ib].qs[16 * il + j] = uint8_t(clamp(data_a[b_idx + j] * id, -128.0, 127.0));
    }
}I don't know how to proceed further in the implementation.
lin72h
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request