Skip to content

Conversation

ggerganov
Copy link
Member

From the llama-batched-bench numbers in #16490 (comment) I realized that the wrong Metal kernels were being used for SSM models. Also fix some mv kernels when src0 is permuted.

make -j && ./bin/llama-batched-bench -hf unsloth/granite-4.0-h-micro-GGUF:Q4_0 -c 4096 -b 2048 -ub 512 -npp 512 -ntg 128 -npl 1,2,4 -ngl 99 -fa on

Before

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 128 1 640 0.351 1458.49 1.358 94.25 1.709 374.45
512 128 2 1280 0.670 1527.90 3.064 83.54 3.735 342.75
512 128 4 2560 1.347 1520.46 4.693 109.10 6.040 423.86

After

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 128 1 640 0.337 1520.61 1.359 94.16 1.696 377.33
512 128 2 1280 0.682 1501.49 1.976 129.57 2.658 481.61
512 128 4 2560 1.347 1520.02 2.895 176.89 4.242 603.51

@ggerganov ggerganov requested a review from CISC as a code owner October 10, 2025 07:28
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 10, 2025
Copy link
Collaborator

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this with llama-batched-bench and llama-parallel and confirmed that I see correctly managed parallel requests and the expected speed up. Thank you for this fix!

@ggerganov ggerganov merged commit a3cb047 into master Oct 11, 2025
69 checks passed
@ggerganov ggerganov deleted the gg/metal-mul-mat-fixes branch October 11, 2025 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants