-
Notifications
You must be signed in to change notification settings - Fork 13.8k
opencl: add kernel to handle mat mul in attention to improve encoding speed #17181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opencl: add kernel to handle mat mul in attention to improve encoding speed #17181
Conversation
|
On X Elite (X1-85), master
this PR,
|
… speed (ggml-org#17181) * Add mul_mm_f16_f32_kq_kqv kernel * Add ggml_cl_mul_mat_kq_kqv_adreno func * fix whitespace * remove unused variable * remove redundant * refactor and clean up * remove trailing whitespace
|
@shaofeiqi @max-krasnyansky on 8GEN3, this PR will decrese decoding performance。 without this pr
with this pr
test comand |
|
@lippman1125 I suppose you are referring to tg. This PR should not affect tg. On master, using 8Gen3, ggml_opencl: device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)' without this change (manually comment out lines 6897 - 6902),
build: 7d77f07 (7108) with this change,
build: 7d77f07 (7108) tg numbers seem about the same for my setup. |
|
@lhez Thanks for your reply , I verify it again, It's no problem. Good Job! |
This PR adds a new kernel to specifically handle the matrix multiply in attention. This should improve encoding performance for most models.