Question About use_adreno_kernels Threshold for Q4 MatMul on Adreno 750 #17733
Replies: 1 comment 1 reply
-
|
@forforever73 Apologies for the delay. |
Beta Was this translation helpful? Give feedback.
-
|
@forforever73 Apologies for the delay. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
@lhez Sorry for taking your time, I’m running a new model on an Adreno 750 GPU and noticed for Q4 weights, using the optimized kernel CL_mul_mat_Ab_Bi_8x4 seems to require that use_adreno_kernels() returns true. However, in my model there are several matmul shapes like:
A: [256, 1280, 1, 1]
B: [256, 512, 1, 1]
→ Output: [1280, 512, 1, 1]
So the kernel falls back to kernel_mul_mat_q4_0_f32_1d_8x_flat. This fallback kernel is about 10× slower on Adreno 750. I experimented by modifying the internal threshold
int64_t threshold_ne0 = 256;After lowering the threshold, the Adreno kernels are used, performance improves dramatically, and the model’s PPL shows no meaningful change.
So what was the original reasoning behind the use_adreno_kernels() threshold? If I reduce the threshold to 256, is there any potential risk I should be aware of?
Beta Was this translation helpful? Give feedback.
All reactions