Skip to content

Conversation

@0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Oct 31, 2025

Add k-quant mul_mat_vec support, and enable MUL_MAT_ID integer dot vector path.

Tuning this is quite difficult. I've included an attempt, but I'm not done. I'll add performance numbers later.

Q3_K and Q6_K currently don't work well at all, I'm still trying to figure out why.

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 31, 2025
@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from d5192bf to d2f8f00 Compare November 1, 2025 11:31
@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 1, 2025

AMD Radeon Pro VII

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 63.49 ± 0.20 71.40 ± 0.24 83.84 ± 0.26 +17.4%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 64.74 ± 0.12 67.75 ± 0.09 78.96 ± 0.20 +16.5%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 48.80 ± 0.08 60.59 ± 0.14 59.91 ± 0.24 -1.1%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 49.47 ± 0.44 58.06 ± 0.11 57.43 ± 0.04 -1.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 65.92 ± 0.15 72.60 ± 0.17 76.77 ± 0.24 +5.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 67.66 ± 0.18 69.41 ± 0.12 72.90 ± 0.19 +5.0%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 19.10 ± 0.16 19.11 ± 0.09 24.50 ± 0.16 +28.2%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 19.00 ± 0.05 18.24 ± 0.21 23.61 ± 0.22 +29.4%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 80.04 ± 0.02 90.66 ± 0.17 87.32 ± 0.46 -3.7%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 80.24 ± 0.10 86.01 ± 5.01 86.50 ± 0.53 +0.6%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 67.68 ± 0.06 82.89 ± 0.22 85.36 ± 0.61 +3.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 70.80 ± 0.03 75.71 ± 0.17 77.52 ± 0.12 +2.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 107.99 ± 0.65 127.26 ± 0.27 128.89 ± 0.75 +1.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 114.36 ± 0.11 125.49 ± 0.07 126.27 ± 0.37 +0.6%

AMD Radeon RX 6800 XT

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 93.30 ± 0.25 115.95 ± 3.40 122.98 ± 0.14 +6.1%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 95.99 ± 0.11 109.65 ± 1.76 113.62 ± 0.02 +3.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 75.50 ± 0.01 93.13 ± 0.05 90.81 ± 0.01 -2.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 77.68 ± 0.00 88.41 ± 0.04 86.52 ± 0.01 -2.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 101.67 ± 0.04 148.71 ± 0.08 151.96 ± 0.03 +2.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 106.92 ± 0.01 136.12 ± 0.39 137.91 ± 0.04 +1.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 120.05 ± 0.05 145.28 ± 0.05 145.86 ± 0.02 +0.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 124.10 ± 0.00 142.70 ± 0.06 143.23 ± 0.04 +0.4%

Intel A770

model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 29.90 ± 0.32 44.53 ± 0.74 +48.9%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 19.55 ± 0.01 26.37 ± 0.00 +34.9%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 15.91 ± 0.01 15.92 ± 0.02 +0.1%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 12.52 ± 0.03 12.56 ± 0.01 +0.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 38.36 ± 0.04 47.72 ± 0.05 +24.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 29.89 ± 0.01 34.91 ± 0.02 +16.8%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 12.00 ± 0.01 14.29 ± 1.43 +19.1%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 10.46 ± 0.02 11.90 ± 0.34 +13.8%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 46.88 ± 2.27 49.79 ± 5.03 +6.2%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 47.69 ± 0.42 51.01 ± 0.11 +7.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 43.62 ± 0.04 41.81 ± 0.21 -4.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 28.22 ± 0.05 28.22 ± 0.01 +0.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 23.94 ± 0.03 39.25 ± 0.02 +64.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 22.87 ± 0.05 36.10 ± 0.01 +57.8%

RTX 3090

model size params ngl fa test t/s (CUDA) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 138.00 ± 0.66 114.32 ± 0.45 112.74 ± 0.36 -1.4%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 136.82 ± 0.35 116.74 ± 0.35 114.95 ± 0.29 -1.5%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 105.80 ± 0.29 98.13 ± 0.18 95.82 ± 0.58 -2.4%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 105.10 ± 0.27 100.27 ± 0.37 96.59 ± 0.37 -3.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 145.41 ± 0.43 123.22 ± 0.41 121.58 ± 2.54 -1.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 144.52 ± 0.09 125.32 ± 0.18 126.04 ± 0.19 +0.6%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 48.59 ± 0.03 38.82 ± 0.63 41.02 ± 0.18 +5.7%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 48.44 ± 0.06 39.31 ± 0.14 41.31 ± 0.09 +5.1%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 141.75 ± 0.46 143.90 ± 0.91 145.12 ± 1.67 +0.8%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 141.72 ± 0.44 144.40 ± 0.24 145.24 ± 0.20 +0.6%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 165.61 ± 1.53 151.74 ± 7.18 153.97 ± 0.99 +1.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 162.49 ± 0.32 159.56 ± 1.25 159.13 ± 0.85 -0.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 205.45 ± 1.12 153.52 ± 12.40 160.16 ± 17.99 +4.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 210.33 ± 0.86 159.12 ± 0.81 172.44 ± 0.27 +8.4%

@0cc4m 0cc4m marked this pull request as ready for review November 1, 2025 11:47
@0cc4m 0cc4m requested a review from jeffbolznv November 1, 2025 11:48
Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did a quick read through. I'll do some perf testing soon.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 2, 2025

As usual, I appear to have caused an llvmpipe issue. I'll look into it.

@jeffbolznv
Copy link
Collaborator

Some initial perf results:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       239.48 ± 11.34 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        201.44 ± 7.81 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        129.84 ± 4.07 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       872.67 ± 15.33 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       845.99 ± 13.20 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       391.09 ± 24.08 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       265.33 ± 14.59 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       251.59 ± 17.44 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       305.19 ± 28.81 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       301.64 ± 24.09 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       356.71 ± 17.34 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        273.06 ± 2.17 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       317.10 ± 15.70 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         91.93 ± 0.22 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         49.29 ± 0.22 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         91.03 ± 1.52 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         70.20 ± 0.40 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         48.53 ± 0.66 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       431.26 ± 28.74 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       397.86 ± 23.85 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        167.72 ± 3.56 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       153.41 ± 10.78 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        103.66 ± 3.49 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       173.04 ± 12.22 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |         37.22 ± 0.54 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        159.48 ± 1.35 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        108.88 ± 0.43 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        125.48 ± 0.54 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       238.12 ± 12.03 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        202.69 ± 5.07 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.12 ± 4.19 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       855.76 ± 15.46 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |      641.24 ± 260.16 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       396.68 ± 14.22 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        264.39 ± 8.21 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       250.60 ± 18.72 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       317.92 ± 10.59 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       325.54 ± 12.60 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       358.63 ± 16.21 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        277.27 ± 4.62 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        327.73 ± 7.12 |
| llama ?B Q4_K - Medium         |  12.42 GiB |    22.24 B | Vulkan     |  99 |  1 |           tg128 |         92.43 ± 2.13 |
| deci 70B Q4_K - Small          |  26.66 GiB |    49.87 B | Vulkan     |  99 |  1 |           tg128 |         50.05 ± 0.23 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\Mistral-22B-v0.2-Q4_K_M.gguf -m c:\models\nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         91.30 ± 0.94 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |         71.16 ± 0.26 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |         49.35 ± 0.18 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        461.59 ± 1.94 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        420.99 ± 1.95 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        167.92 ± 2.62 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        152.94 ± 8.52 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        106.06 ± 3.89 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       178.63 ± 16.11 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |         41.86 ± 1.68 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        160.77 ± 1.69 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        108.78 ± 1.08 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        125.95 ± 0.12 |

I reran some of the models with the biggest deltas. Most seem to be noise, except the improvement for gpt-oss MXFP4 is real:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       314.61 ± 23.74 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        323.84 ± 1.17 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        322.33 ± 2.26 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        319.46 ± 2.80 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        318.55 ± 3.96 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\llama-3.2-3b-instruct-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        332.90 ± 5.17 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.56 ± 0.96 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.42 ± 7.14 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.52 ± 6.45 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        334.98 ± 1.17 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       327.08 ± 19.41 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.18 ± 5.79 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        339.58 ± 3.17 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        338.76 ± 2.68 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        337.12 ± 5.83 |

build: 5d8bb900b (6910)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        132.41 ± 3.78 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.42 ± 0.73 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.74 ± 0.18 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.36 ± 0.23 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.26 ± 0.30 |

after:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |       331.53 ± 16.17 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        335.87 ± 1.67 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        334.85 ± 4.53 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        334.90 ± 2.64 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        333.53 ± 3.58 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\llama-3.2-3b-instruct-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.99 ± 2.56 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.84 ± 1.31 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        330.21 ± 5.07 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        327.78 ± 6.82 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        334.95 ± 1.13 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       321.82 ± 31.23 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        329.96 ± 4.85 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        335.48 ± 2.55 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.77 ± 6.32 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        334.00 ± 5.05 |

build: b153aac38 (6921)

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128,128,128,128,128 -p 0 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        131.75 ± 3.42 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.28 ± 0.68 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.52 ± 0.39 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.62 ± 0.41 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.60 ± 0.40 |

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from b153aac to 1b78909 Compare November 7, 2025 19:51
@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 7, 2025

Most seem to be noise, except the improvement for gpt-oss MXFP4 is real

The funny thing about that is that I didn't even enable the MMVQ path for Nvidia Turing+ on MXFP4. Not sure what is going on there.

I still have some tuning to do here, my Strix Halo device isn't liking this PR yet.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from 1b78909 to 937f992 Compare November 15, 2025 13:26
@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 15, 2025

The tuning seems okay now, even though I didn't change anything. @jeffbolznv Please take another look. Did you have any concerns with your benchmarks?

Here are updated results:

AMD Radeon 8060S

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 44.26 ± 0.95 56.34 ± 2.35 60.17 ± 1.18 +6.8%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 45.61 ± 0.05 52.61 ± 1.51 58.76 ± 0.07 +11.7%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 34.70 ± 0.14 46.78 ± 1.25 48.09 ± 1.42 +2.8%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 35.60 ± 0.03 46.05 ± 0.23 47.44 ± 0.22 +3.0%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 35.68 ± 0.05 43.27 ± 0.23 43.42 ± 0.32 +0.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 37.39 ± 0.05 42.69 ± 0.12 43.55 ± 0.06 +2.0%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 68.22 ± 0.22 91.84 ± 5.25 90.58 ± 1.56 -1.4%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 68.39 ± 0.40 89.59 ± 0.81 89.12 ± 0.93 -0.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 55.58 ± 0.44 91.88 ± 0.94 95.63 ± 0.64 +4.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 59.90 ± 0.38 88.72 ± 0.91 93.37 ± 0.39 +5.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 61.70 ± 0.17 69.55 ± 1.05 72.25 ± 0.74 +3.9%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 64.48 ± 0.33 70.75 ± 0.44 73.16 ± 0.17 +3.4%

AMD RX 6800 XT

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 91.24 ± 0.24 117.51 ± 0.54 123.59 ± 0.21 +5.2%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 96.87 ± 0.10 111.03 ± 0.15 114.65 ± 0.01 +3.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 73.90 ± 0.01 93.52 ± 0.05 91.08 ± 0.02 -2.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 78.24 ± 0.00 88.73 ± 0.02 86.83 ± 0.01 -2.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 97.29 ± 0.07 154.58 ± 1.14 162.02 ± 0.03 +4.8%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 105.90 ± 0.00 139.83 ± 0.03 145.91 ± 0.02 +4.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 116.54 ± 0.01 145.73 ± 0.76 147.00 ± 0.01 +0.9%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 122.11 ± 0.01 143.62 ± 0.02 144.69 ± 0.05 +0.7%

AMD Radeon Pro VII

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 61.77 ± 0.21 72.03 ± 0.31 86.22 ± 0.39 +19.7%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 64.29 ± 0.13 68.72 ± 0.16 81.18 ± 0.89 +18.1%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 47.98 ± 0.08 61.70 ± 0.18 60.99 ± 0.59 -1.2%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 49.86 ± 0.00 59.22 ± 0.08 58.43 ± 0.24 -1.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 65.02 ± 0.17 73.87 ± 0.39 78.50 ± 0.44 +6.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 68.17 ± 0.11 70.42 ± 0.25 75.18 ± 0.17 +6.8%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 19.27 ± 0.16 19.19 ± 0.02 24.82 ± 0.09 +29.3%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 19.41 ± 0.03 18.88 ± 0.05 24.08 ± 0.13 +27.5%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 80.06 ± 0.09 83.66 ± 5.57 82.35 ± 2.57 -1.6%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 80.35 ± 0.11 78.49 ± 1.53 82.49 ± 2.99 +5.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 62.83 ± 0.02 85.69 ± 0.94 90.46 ± 1.04 +5.6%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 68.49 ± 0.00 77.31 ± 0.79 81.04 ± 0.60 +4.8%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 101.34 ± 0.04 128.84 ± 0.12 128.34 ± 1.83 -0.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 113.33 ± 0.08 125.89 ± 0.23 126.20 ± 1.51 +0.2%

Intel A770

model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 30.20 ± 0.43 44.36 ± 0.85 +46.9%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 20.15 ± 0.01 27.51 ± 0.06 +36.5%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 15.83 ± 0.03 15.88 ± 0.04 +0.3%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 12.77 ± 0.02 12.79 ± 0.02 +0.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 38.31 ± 0.05 46.81 ± 0.73 +22.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 31.17 ± 0.12 36.67 ± 0.10 +17.6%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 12.02 ± 0.01 14.82 ± 1.23 +23.3%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 10.70 ± 0.00 12.08 ± 0.33 +12.9%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 46.69 ± 3.24 48.46 ± 2.09 +3.8%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 49.76 ± 0.08 51.98 ± 0.06 +4.5%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 23.34 ± 0.02 38.08 ± 0.12 +63.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 22.62 ± 0.06 36.32 ± 0.08 +60.6%

Nvidia RTX 3090

model size params ngl fa test t/s (CUDA) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 tg128 137.34 ± 0.39 118.30 ± 0.45 117.37 ± 0.28 -0.8%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 tg128 139.96 ± 0.52 120.61 ± 0.18 119.81 ± 0.30 -0.7%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 0 tg128 106.20 ± 0.37 101.26 ± 0.43 100.46 ± 0.36 -0.8%
llama 8B Q3_K - Small 3.41 GiB 8.03 B 99 1 tg128 107.55 ± 0.21 102.67 ± 0.69 101.69 ± 0.70 -1.0%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 tg128 143.67 ± 0.30 120.73 ± 4.47 121.30 ± 5.42 +0.5%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 146.97 ± 0.18 125.54 ± 0.94 126.59 ± 2.29 +0.8%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 0 tg128 48.35 ± 0.01 40.43 ± 0.48 42.24 ± 0.13 +4.5%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 48.82 ± 0.05 40.85 ± 0.14 42.32 ± 0.05 +3.6%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 0 tg128 141.25 ± 0.92 142.11 ± 12.00 145.57 ± 12.59 +2.4%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 142.04 ± 0.34 149.71 ± 1.52 151.22 ± 0.42 +1.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 tg128 155.74 ± 1.15 153.97 ± 16.85 153.30 ± 16.31 -0.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 164.32 ± 3.05 167.49 ± 1.56 167.34 ± 1.11 -0.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 tg128 197.83 ± 0.79 154.47 ± 12.95 163.89 ± 14.45 +6.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 207.36 ± 0.49 164.30 ± 0.59 175.28 ± 1.24 +6.7%

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 16, 2025

Something's broken in the nvidia-vulkan-cm and cm2 runs, I'll look into it.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 16, 2025

I can't reproduce the problem, even on my RTX 3090, with coopmat2, coopmat or without coopmat. Not sure what is going on. It looks like incoherence, but for me the example runs just fine. @jeffbolznv any ideas?

@jeffbolznv
Copy link
Collaborator

I pulled the branch but wasn't able to reproduce the failure. I don't have any great ideas - maybe some missing bounds checking?

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 16, 2025

MMVQ is much simpler with bounds checking, since all the inputs are in blocks of 256, 128 or 32. I didn't change how the output is stored, so I don't think that's likely.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from 3c22e38 to e086733 Compare November 19, 2025 15:27
@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 19, 2025

I would like to merge this, but the CI keeps failing in a way I can't reproduce or understand. cmake-vulkan now failed with illegal segfaults on llvmpipe. What is going on there?

@Acly
Copy link
Collaborator

Acly commented Nov 19, 2025

Had those Illegal (instruction) failures once in a PR and it was related to bad Ccache. Maybe you can clear it and re-run that test.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 20, 2025

How do I clear it?

@Acly
Copy link
Collaborator

Acly commented Nov 20, 2025

I'm guessing find the ccache entry related to this PR in https://github.com/ggml-org/llama.cpp/actions/caches and delete it. I don't have the required permissions, maybe you do. @slaren did it at the time.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from e086733 to e69d645 Compare November 22, 2025 09:57
@0cc4m 0cc4m marked this pull request as draft November 22, 2025 12:11
@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-vec-k-quants branch from 9d0f9af to 9cbe4f8 Compare November 23, 2025 09:39
@github-actions github-actions bot added the testing Everything test related label Nov 23, 2025
@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 25, 2025

So I've done about as much as I can think of to try to figure out what is going on with the llama-save-load-state test on the Nvidia Vulkan CI:

  • I can't reproduce it on my RTX 3090 setup.
  • I can't reproduce it on a RTX 2060 laptop setup, which is the same architecture as the T4.
  • I added backend tests that check what happens in the model, but they pass without triggering the issue.
  • The issue goes away if I disable integer dot mmvq, so this PR is clearly causing it, but not in a way that affects any other system.

@ggerganov Is anything about the CI Nvidia setup unusual or special? Maybe it's an issue with a specific driver?

@ggerganov
Copy link
Member

Taking a look now

@ggerganov
Copy link
Member

@0cc4m Do you reproduce using these steps:

# converted from https://huggingface.co/Qwen/Qwen3-0.6B-Base
wget https://huggingface.co/ggerganov/qwen-tmp/resolve/main/qwen3-0.6b-base-bf16.gguf

./bin/llama-quantize ./qwen3-0.6b-base-bf16.gguf ./qwen3-0.6b-base-q4_0.gguf q4_0

./bin/llama-save-load-state -m ./qwen3-0.6b-base-q4_0.gguf -ngl 10 -c 1024 -fa off

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 25, 2025

@ggerganov I basically did that, yes, only that I downloaded https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/blob/main/Qwen3-0.6B-Q4_0.gguf instead of quantizing the base model myself. I just gave your way a try and it still just passes on my hardware, in many tries.

@ggerganov
Copy link
Member

The runners are also successful when using https://huggingface.co/ggml-org/Qwen3-0.6B-GGUF/blob/main/Qwen3-0.6B-Q4_0.gguf, but fail with the model that I provided. The difference is that in the model that I provided, the input embeddings and the output tensor are merged, while in the ggml-org model they are separate.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 25, 2025

Thank you, that was it! It still doesn't happen on my RTX 3090, but now I was able to reproduce it on a laptop RTX 2060. I guess it's Turing-specific (@jeffbolznv ).

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 25, 2025

This is an extremely specific and fragile issue. It really only happens on Nvidia Turing, with this specific model and quant, with this specific amount of layers (10), with flash attention off. Even if I just enable vulkan result checks (which run each operator individually and compare their results to the same operation on CPU), the issue goes away.

Also if I disable operator fusion (GGML_VK_DISABLE_FUSION=1), if I disable the MMVQ method (which is what this PR has changed), or even if I force enable it (so that it also gets used in cases where previously it would not, for performance reasons).

You can see some kind of incoherence going on with ngl 9 and 10:

first run: The quick brown fox jumps over the lazy dog. 
编一个和homozyosyic

[...]

second run: The quick brown fox jumps over the lazy dog. 
OPEN A new +:( & wrap

Even if I just go to 8 or 11 layers offloaded, it is fixed:

first run: The quick brown fox jumps over the lazy dog. In this example, a topic is introduced, a

[...]

second run: The quick brown fox jumps over the lazy dog. In this example, a topic is introduced, a

If I disable integer dot (which takes out the changes in this PR), it runs correctly, but output still looks incoherent:

first run: The quick brown fox jumps over the lazy dog. 
标出法。这是一句主二

[...]

second run: The quick brown fox jumps over the lazy dog. 
标出法。这是一句主二

@ggerganov Is it possible this is some kind of numerical model issue triggered by very small and specific changes in operator outputs? I don't see how this PR can be at fault if even using its changes more often (by increasing ngl or forcing more operators to use the MMVQ path) fixes it.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 25, 2025

I was finally able to reproduce it on AMD RX 8060S, but there it only happens with ngl 21, not 10.

It also goes away if I change the sampling seed in save-load-state.cpp or if I change the prompt to "The slow brown fox".

@ggerganov
Copy link
Member

The incoherent texts are most likely due to some numerical issues. I guess this can be expected.

The problem that I am worried is that this test is needed to guarantee that restoring a previously saved state would lead to the same generation. Somehow, this is not always the case here which is very strange. The only reason that I can see is that during the store and restore of the KV cache it probably gets slightly mutated for some reason. But I don't see how this could happen since we do just byte-level copies of the buffers.

@jeffbolznv
Copy link
Collaborator

I want to make sure I understand what this test is doing. Is the save/restore happening between tokens, or does it happen in the middle of generating a token? Is it using the get/set_tensor functions to do the save/restore?

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 25, 2025

I think the part that is failing here is processing the prompt, saving the state to a file, generating a number of tokens, restoring the state, generating the same amount again and then checking whether the two text results are identical.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 25, 2025

I built a (WIP) tool in https://github.com/ggml-org/llama.cpp/tree/0cc4m/model-backend-compare to run CPU and device(s) side-by-side and compare the intermediate results. I didn't see anything concerning in the results of this model though:

Log
nmse = 0.000000000000 tensor 'inp_embd[token_embd.weight, leaf_2]' op = GET_ROWS
nmse = 0.000000000000 tensor 'norm-0[inp_embd]' op = RMS_NORM
nmse = 0.000000000000 tensor 'attn_norm-0[norm-0, blk.0.attn_norm.weight]' op = MUL
nmse = 0.000000128991 tensor 'Qcur-0[blk.0.attn_q.weight, attn_norm-0]' op = MUL_MAT
nmse = 0.000000382139 tensor 'Vcur-0[blk.0.attn_v.weight, attn_norm-0]' op = MUL_MAT
nmse = 0.000000092694 tensor 'Kcur-0[blk.0.attn_k.weight, attn_norm-0]' op = MUL_MAT
nmse = 0.000000244310 tensor 'norm-0[Qcur-0 (reshaped)]' op = RMS_NORM
nmse = 0.000000182836 tensor 'Qcur_normed-0[norm-0, blk.0.attn_q_norm.weight]' op = MUL
nmse = 0.000000182836 tensor 'Qcur-0[Qcur_normed-0, leaf_5]' op = ROPE
nmse = 0.000000119222 tensor 'norm-0[Kcur-0 (reshaped)]' op = RMS_NORM
nmse = 0.000000015708 tensor 'Kcur_normed-0[norm-0, blk.0.attn_k_norm.weight]' op = MUL
nmse = 0.000000015708 tensor 'Kcur-0[Kcur_normed-0, leaf_5]' op = ROPE
nmse = 0.000000074319 tensor 'cache_k_l0 (view)[Kcur-0 (view), leaf_9, cache_k_l0]' op = SET_ROWS
nmse = 0.000000456767 tensor 'cache_v_l0 (reshaped) (view)[Vcur-0 (reshaped) (reshaped), leaf_11, cache_v_l0 (reshaped)]' op = SET_ROWS
nmse = 0.000000028554 tensor 'kq-0[cache_k_l0 (view) (permuted), Qcur-0 (view) (permuted)]' op = MUL_MAT
nmse = 0.000000074950 tensor 'kq_soft_max-0[kq-0, leaf_13]' op = SOFT_MAX
nmse = 0.000000699935 tensor 'kqv-0[cache_v_l0 (view) (permuted), kq_soft_max-0]' op = MUL_MAT
nmse = 0.000000699935 tensor 'kqv_out-0[kqv-0 (permuted)]' op = CONT
nmse = 0.000009324064 tensor 'node_32[blk.0.attn_output.weight, kqv_out-0]' op = MUL_MAT
nmse = 0.000009195676 tensor 'ffn_inp-0[node_32, inp_embd]' op = ADD
nmse = 0.000009735010 tensor 'norm-0[ffn_inp-0]' op = RMS_NORM
nmse = 0.000011593351 tensor 'ffn_norm-0[norm-0, blk.0.ffn_norm.weight]' op = MUL
nmse = 0.000027872600 tensor 'ffn_gate-0[blk.0.ffn_gate.weight, ffn_norm-0]' op = MUL_MAT
nmse = 0.000063504477 tensor 'ffn_up-0[blk.0.ffn_up.weight, ffn_norm-0]' op = MUL_MAT
nmse = 0.000066113801 tensor 'ffn_swiglu-0[ffn_gate-0, ffn_up-0]' op = GLU
nmse = 0.000110117406 tensor 'ffn_out-0[blk.0.ffn_down.weight, ffn_swiglu-0]' op = MUL_MAT
nmse = 0.000037539915 tensor 'l_out-0[ffn_out-0, ffn_inp-0]' op = ADD
nmse = 0.000037132027 tensor 'norm-1[l_out-0]' op = RMS_NORM
nmse = 0.000042113173 tensor 'attn_norm-1[norm-1, blk.1.attn_norm.weight]' op = MUL
nmse = 0.000070427803 tensor 'Qcur-1[blk.1.attn_q.weight, attn_norm-1]' op = MUL_MAT
nmse = 0.000143869692 tensor 'Vcur-1[blk.1.attn_v.weight, attn_norm-1]' op = MUL_MAT
nmse = 0.000043203304 tensor 'Kcur-1[blk.1.attn_k.weight, attn_norm-1]' op = MUL_MAT
nmse = 0.000109360995 tensor 'norm-1[Qcur-1 (reshaped)]' op = RMS_NORM
nmse = 0.000099945595 tensor 'Qcur_normed-1[norm-1, blk.1.attn_q_norm.weight]' op = MUL
nmse = 0.000099945569 tensor 'Qcur-1[Qcur_normed-1, leaf_5]' op = ROPE
nmse = 0.000046160929 tensor 'norm-1[Kcur-1 (reshaped)]' op = RMS_NORM
nmse = 0.000003439169 tensor 'Kcur_normed-1[norm-1, blk.1.attn_k_norm.weight]' op = MUL
nmse = 0.000003439175 tensor 'Kcur-1[Kcur_normed-1, leaf_5]' op = ROPE
nmse = 0.000003694133 tensor 'cache_k_l1 (view)[Kcur-1 (view), leaf_9, cache_k_l1]' op = SET_ROWS
nmse = 0.000143940203 tensor 'cache_v_l1 (reshaped) (view)[Vcur-1 (reshaped) (reshaped), leaf_11, cache_v_l1 (reshaped)]' op = SET_ROWS
nmse = 0.000025982567 tensor 'kq-1[cache_k_l1 (view) (permuted), Qcur-1 (view) (permuted)]' op = MUL_MAT
nmse = 0.000044625212 tensor 'kq_soft_max-1[kq-1, leaf_13]' op = SOFT_MAX
nmse = 0.000249854241 tensor 'kqv-1[cache_v_l1 (view) (permuted), kq_soft_max-1]' op = MUL_MAT
nmse = 0.000249854241 tensor 'kqv_out-1[kqv-1 (permuted)]' op = CONT
nmse = 0.000301482523 tensor 'node_72[blk.1.attn_output.weight, kqv_out-1]' op = MUL_MAT
nmse = 0.000057836557 tensor 'ffn_inp-1[node_72, l_out-0]' op = ADD
nmse = 0.000057776828 tensor 'norm-1[ffn_inp-1]' op = RMS_NORM
nmse = 0.000045036137 tensor 'ffn_norm-1[norm-1, blk.1.ffn_norm.weight]' op = MUL
nmse = 0.000037830436 tensor 'ffn_gate-1[blk.1.ffn_gate.weight, ffn_norm-1]' op = MUL_MAT
nmse = 0.000161085016 tensor 'ffn_up-1[blk.1.ffn_up.weight, ffn_norm-1]' op = MUL_MAT
nmse = 0.000225061456 tensor 'ffn_swiglu-1[ffn_gate-1, ffn_up-1]' op = GLU
nmse = 0.000220768817 tensor 'ffn_out-1[blk.1.ffn_down.weight, ffn_swiglu-1]' op = MUL_MAT
nmse = 0.000093593501 tensor 'l_out-1[ffn_out-1, ffn_inp-1]' op = ADD
nmse = 0.000075989012 tensor 'norm-2[l_out-1]' op = RMS_NORM
nmse = 0.000057059104 tensor 'attn_norm-2[norm-2, blk.2.attn_norm.weight]' op = MUL
nmse = 0.000071723848 tensor 'Qcur-2[blk.2.attn_q.weight, attn_norm-2]' op = MUL_MAT
nmse = 0.000134525419 tensor 'Vcur-2[blk.2.attn_v.weight, attn_norm-2]' op = MUL_MAT
nmse = 0.000049303990 tensor 'Kcur-2[blk.2.attn_k.weight, attn_norm-2]' op = MUL_MAT
nmse = 0.000099870205 tensor 'norm-2[Qcur-2 (reshaped)]' op = RMS_NORM
nmse = 0.000115771151 tensor 'Qcur_normed-2[norm-2, blk.2.attn_q_norm.weight]' op = MUL
nmse = 0.000115771091 tensor 'Qcur-2[Qcur_normed-2, leaf_5]' op = ROPE
nmse = 0.000067215894 tensor 'norm-2[Kcur-2 (reshaped)]' op = RMS_NORM
nmse = 0.000012410045 tensor 'Kcur_normed-2[norm-2, blk.2.attn_k_norm.weight]' op = MUL
nmse = 0.000012410038 tensor 'Kcur-2[Kcur_normed-2, leaf_5]' op = ROPE
nmse = 0.000012639256 tensor 'cache_k_l2 (view)[Kcur-2 (view), leaf_9, cache_k_l2]' op = SET_ROWS
nmse = 0.000134562794 tensor 'cache_v_l2 (reshaped) (view)[Vcur-2 (reshaped) (reshaped), leaf_11, cache_v_l2 (reshaped)]' op = SET_ROWS
nmse = 0.000022854909 tensor 'kq-2[cache_k_l2 (view) (permuted), Qcur-2 (view) (permuted)]' op = MUL_MAT
nmse = 0.000060094775 tensor 'kq_soft_max-2[kq-2, leaf_13]' op = SOFT_MAX
nmse = 0.000156762264 tensor 'kqv-2[cache_v_l2 (view) (permuted), kq_soft_max-2]' op = MUL_MAT
nmse = 0.000156762264 tensor 'kqv_out-2[kqv-2 (permuted)]' op = CONT
nmse = 0.000108410189 tensor 'node_112[blk.2.attn_output.weight, kqv_out-2]' op = MUL_MAT
nmse = 0.000091925762 tensor 'ffn_inp-2[node_112, l_out-1]' op = ADD
nmse = 0.000091102040 tensor 'norm-2[ffn_inp-2]' op = RMS_NORM
nmse = 0.000062452055 tensor 'ffn_norm-2[norm-2, blk.2.ffn_norm.weight]' op = MUL
nmse = 0.000065326061 tensor 'ffn_gate-2[blk.2.ffn_gate.weight, ffn_norm-2]' op = MUL_MAT
nmse = 0.000056519733 tensor 'ffn_up-2[blk.2.ffn_up.weight, ffn_norm-2]' op = MUL_MAT
nmse = 0.000000201982 tensor 'ffn_swiglu-2[ffn_gate-2, ffn_up-2]' op = GLU
nmse = 0.000000182218 tensor 'ffn_out-2[blk.2.ffn_down.weight, ffn_swiglu-2]' op = MUL_MAT
nmse = 0.000000185309 tensor 'l_out-2[ffn_out-2, ffn_inp-2]' op = ADD
nmse = 0.000099881032 tensor 'norm-3[l_out-2]' op = RMS_NORM
nmse = 0.000151292186 tensor 'attn_norm-3[norm-3, blk.3.attn_norm.weight]' op = MUL
nmse = 0.000109464286 tensor 'Qcur-3[blk.3.attn_q.weight, attn_norm-3]' op = MUL_MAT
nmse = 0.000335525038 tensor 'Vcur-3[blk.3.attn_v.weight, attn_norm-3]' op = MUL_MAT
nmse = 0.000142978812 tensor 'Kcur-3[blk.3.attn_k.weight, attn_norm-3]' op = MUL_MAT
nmse = 0.000120847163 tensor 'norm-3[Qcur-3 (reshaped)]' op = RMS_NORM
nmse = 0.000113708516 tensor 'Qcur_normed-3[norm-3, blk.3.attn_q_norm.weight]' op = MUL
nmse = 0.000113708475 tensor 'Qcur-3[Qcur_normed-3, leaf_5]' op = ROPE
nmse = 0.000119749926 tensor 'norm-3[Kcur-3 (reshaped)]' op = RMS_NORM
nmse = 0.000028574110 tensor 'Kcur_normed-3[norm-3, blk.3.attn_k_norm.weight]' op = MUL
nmse = 0.000028574103 tensor 'Kcur-3[Kcur_normed-3, leaf_5]' op = ROPE
nmse = 0.000028930934 tensor 'cache_k_l3 (view)[Kcur-3 (view), leaf_9, cache_k_l3]' op = SET_ROWS
nmse = 0.000335313873 tensor 'cache_v_l3 (reshaped) (view)[Vcur-3 (reshaped) (reshaped), leaf_11, cache_v_l3 (reshaped)]' op = SET_ROWS
nmse = 0.000021274552 tensor 'kq-3[cache_k_l3 (view) (permuted), Qcur-3 (view) (permuted)]' op = MUL_MAT
nmse = 0.000027780559 tensor 'kq_soft_max-3[kq-3, leaf_13]' op = SOFT_MAX
nmse = 0.000848972839 tensor 'kqv-3[cache_v_l3 (view) (permuted), kq_soft_max-3]' op = MUL_MAT
nmse = 0.000848972839 tensor 'kqv_out-3[kqv-3 (permuted)]' op = CONT
nmse = 0.000782136909 tensor 'node_152[blk.3.attn_output.weight, kqv_out-3]' op = MUL_MAT
nmse = 0.000000186195 tensor 'ffn_inp-3[node_152, l_out-2]' op = ADD
nmse = 0.000120568550 tensor 'norm-3[ffn_inp-3]' op = RMS_NORM
nmse = 0.000155761742 tensor 'ffn_norm-3[norm-3, blk.3.ffn_norm.weight]' op = MUL
nmse = 0.000091953142 tensor 'ffn_gate-3[blk.3.ffn_gate.weight, ffn_norm-3]' op = MUL_MAT
nmse = 0.000333618663 tensor 'ffn_up-3[blk.3.ffn_up.weight, ffn_norm-3]' op = MUL_MAT
nmse = 0.000587693851 tensor 'ffn_swiglu-3[ffn_gate-3, ffn_up-3]' op = GLU
nmse = 0.000651554241 tensor 'ffn_out-3[blk.3.ffn_down.weight, ffn_swiglu-3]' op = MUL_MAT
nmse = 0.000000188103 tensor 'l_out-3[ffn_out-3, ffn_inp-3]' op = ADD
nmse = 0.000147085159 tensor 'norm-4[l_out-3]' op = RMS_NORM
nmse = 0.000190420547 tensor 'attn_norm-4[norm-4, blk.4.attn_norm.weight]' op = MUL
nmse = 0.000130909038 tensor 'Qcur-4[blk.4.attn_q.weight, attn_norm-4]' op = MUL_MAT
nmse = 0.000426061451 tensor 'Vcur-4[blk.4.attn_v.weight, attn_norm-4]' op = MUL_MAT
nmse = 0.000134338635 tensor 'Kcur-4[blk.4.attn_k.weight, attn_norm-4]' op = MUL_MAT
nmse = 0.000133603012 tensor 'norm-4[Qcur-4 (reshaped)]' op = RMS_NORM
nmse = 0.000086131649 tensor 'Qcur_normed-4[norm-4, blk.4.attn_q_norm.weight]' op = MUL
nmse = 0.000086131662 tensor 'Qcur-4[Qcur_normed-4, leaf_5]' op = ROPE
nmse = 0.000135923544 tensor 'norm-4[Kcur-4 (reshaped)]' op = RMS_NORM
nmse = 0.000043817848 tensor 'Kcur_normed-4[norm-4, blk.4.attn_k_norm.weight]' op = MUL
nmse = 0.000043817822 tensor 'Kcur-4[Kcur_normed-4, leaf_5]' op = ROPE
nmse = 0.000044025544 tensor 'cache_k_l4 (view)[Kcur-4 (view), leaf_9, cache_k_l4]' op = SET_ROWS
nmse = 0.000426048487 tensor 'cache_v_l4 (reshaped) (view)[Vcur-4 (reshaped) (reshaped), leaf_11, cache_v_l4 (reshaped)]' op = SET_ROWS
nmse = 0.000023153820 tensor 'kq-4[cache_k_l4 (view) (permuted), Qcur-4 (view) (permuted)]' op = MUL_MAT
nmse = 0.000021681903 tensor 'kq_soft_max-4[kq-4, leaf_13]' op = SOFT_MAX
nmse = 0.000531979945 tensor 'kqv-4[cache_v_l4 (view) (permuted), kq_soft_max-4]' op = MUL_MAT
nmse = 0.000531979945 tensor 'kqv_out-4[kqv-4 (permuted)]' op = CONT
nmse = 0.000741086817 tensor 'node_192[blk.4.attn_output.weight, kqv_out-4]' op = MUL_MAT
nmse = 0.000000187727 tensor 'ffn_inp-4[node_192, l_out-3]' op = ADD
nmse = 0.000154714955 tensor 'norm-4[ffn_inp-4]' op = RMS_NORM
nmse = 0.000245659497 tensor 'ffn_norm-4[norm-4, blk.4.ffn_norm.weight]' op = MUL
nmse = 0.000105132964 tensor 'ffn_gate-4[blk.4.ffn_gate.weight, ffn_norm-4]' op = MUL_MAT
nmse = 0.000410043585 tensor 'ffn_up-4[blk.4.ffn_up.weight, ffn_norm-4]' op = MUL_MAT
nmse = 0.000576161318 tensor 'ffn_swiglu-4[ffn_gate-4, ffn_up-4]' op = GLU
nmse = 0.000812984153 tensor 'ffn_out-4[blk.4.ffn_down.weight, ffn_swiglu-4]' op = MUL_MAT
nmse = 0.000000190269 tensor 'l_out-4[ffn_out-4, ffn_inp-4]' op = ADD
nmse = 0.000173285963 tensor 'norm-5[l_out-4]' op = RMS_NORM
nmse = 0.000210274297 tensor 'attn_norm-5[norm-5, blk.5.attn_norm.weight]' op = MUL
nmse = 0.000206002990 tensor 'Qcur-5[blk.5.attn_q.weight, attn_norm-5]' op = MUL_MAT
nmse = 0.000406373520 tensor 'Vcur-5[blk.5.attn_v.weight, attn_norm-5]' op = MUL_MAT
nmse = 0.000182009335 tensor 'Kcur-5[blk.5.attn_k.weight, attn_norm-5]' op = MUL_MAT
nmse = 0.000214694073 tensor 'norm-5[Qcur-5 (reshaped)]' op = RMS_NORM
nmse = 0.000125262668 tensor 'Qcur_normed-5[norm-5, blk.5.attn_q_norm.weight]' op = MUL
nmse = 0.000125262608 tensor 'Qcur-5[Qcur_normed-5, leaf_5]' op = ROPE
nmse = 0.000170143720 tensor 'norm-5[Kcur-5 (reshaped)]' op = RMS_NORM
nmse = 0.000033016977 tensor 'Kcur_normed-5[norm-5, blk.5.attn_k_norm.weight]' op = MUL
nmse = 0.000033016965 tensor 'Kcur-5[Kcur_normed-5, leaf_5]' op = ROPE
nmse = 0.000032945948 tensor 'cache_k_l5 (view)[Kcur-5 (view), leaf_9, cache_k_l5]' op = SET_ROWS
nmse = 0.000406271398 tensor 'cache_v_l5 (reshaped) (view)[Vcur-5 (reshaped) (reshaped), leaf_11, cache_v_l5 (reshaped)]' op = SET_ROWS
nmse = 0.000060111344 tensor 'kq-5[cache_k_l5 (view) (permuted), Qcur-5 (view) (permuted)]' op = MUL_MAT
nmse = 0.000048342017 tensor 'kq_soft_max-5[kq-5, leaf_13]' op = SOFT_MAX
nmse = 0.000592121222 tensor 'kqv-5[cache_v_l5 (view) (permuted), kq_soft_max-5]' op = MUL_MAT
nmse = 0.000592121222 tensor 'kqv_out-5[kqv-5 (permuted)]' op = CONT
nmse = 0.000546098660 tensor 'node_232[blk.5.attn_output.weight, kqv_out-5]' op = MUL_MAT
nmse = 0.000000193609 tensor 'ffn_inp-5[node_232, l_out-4]' op = ADD
nmse = 0.000192494029 tensor 'norm-5[ffn_inp-5]' op = RMS_NORM
nmse = 0.000338447995 tensor 'ffn_norm-5[norm-5, blk.5.ffn_norm.weight]' op = MUL
nmse = 0.000231709412 tensor 'ffn_gate-5[blk.5.ffn_gate.weight, ffn_norm-5]' op = MUL_MAT
nmse = 0.000478689195 tensor 'ffn_up-5[blk.5.ffn_up.weight, ffn_norm-5]' op = MUL_MAT
nmse = 0.000669565941 tensor 'ffn_swiglu-5[ffn_gate-5, ffn_up-5]' op = GLU
nmse = 0.000809328957 tensor 'ffn_out-5[blk.5.ffn_down.weight, ffn_swiglu-5]' op = MUL_MAT
nmse = 0.000000199025 tensor 'l_out-5[ffn_out-5, ffn_inp-5]' op = ADD
nmse = 0.000227379563 tensor 'norm-6[l_out-5]' op = RMS_NORM
nmse = 0.000303152141 tensor 'attn_norm-6[norm-6, blk.6.attn_norm.weight]' op = MUL
nmse = 0.000145704804 tensor 'Qcur-6[blk.6.attn_q.weight, attn_norm-6]' op = MUL_MAT
nmse = 0.000498639136 tensor 'Vcur-6[blk.6.attn_v.weight, attn_norm-6]' op = MUL_MAT
nmse = 0.000226265425 tensor 'Kcur-6[blk.6.attn_k.weight, attn_norm-6]' op = MUL_MAT
nmse = 0.000147657941 tensor 'norm-6[Qcur-6 (reshaped)]' op = RMS_NORM
nmse = 0.000147898411 tensor 'Qcur_normed-6[norm-6, blk.6.attn_q_norm.weight]' op = MUL
nmse = 0.000147898407 tensor 'Qcur-6[Qcur_normed-6, leaf_5]' op = ROPE
nmse = 0.000175242189 tensor 'norm-6[Kcur-6 (reshaped)]' op = RMS_NORM
nmse = 0.000063842992 tensor 'Kcur_normed-6[norm-6, blk.6.attn_k_norm.weight]' op = MUL
nmse = 0.000063842969 tensor 'Kcur-6[Kcur_normed-6, leaf_5]' op = ROPE
nmse = 0.000064055392 tensor 'cache_k_l6 (view)[Kcur-6 (view), leaf_9, cache_k_l6]' op = SET_ROWS
nmse = 0.000499063605 tensor 'cache_v_l6 (reshaped) (view)[Vcur-6 (reshaped) (reshaped), leaf_11, cache_v_l6 (reshaped)]' op = SET_ROWS
nmse = 0.000016956803 tensor 'kq-6[cache_k_l6 (view) (permuted), Qcur-6 (view) (permuted)]' op = MUL_MAT
nmse = 0.000058555320 tensor 'kq_soft_max-6[kq-6, leaf_13]' op = SOFT_MAX
nmse = 0.000674833930 tensor 'kqv-6[cache_v_l6 (view) (permuted), kq_soft_max-6]' op = MUL_MAT
nmse = 0.000674833930 tensor 'kqv_out-6[kqv-6 (permuted)]' op = CONT
nmse = 0.000706835016 tensor 'node_272[blk.6.attn_output.weight, kqv_out-6]' op = MUL_MAT
nmse = 0.000000201665 tensor 'ffn_inp-6[node_272, l_out-5]' op = ADD
nmse = 0.000260261499 tensor 'norm-6[ffn_inp-6]' op = RMS_NORM
nmse = 0.000440621681 tensor 'ffn_norm-6[norm-6, blk.6.ffn_norm.weight]' op = MUL
nmse = 0.000248814575 tensor 'ffn_gate-6[blk.6.ffn_gate.weight, ffn_norm-6]' op = MUL_MAT
nmse = 0.000556010159 tensor 'ffn_up-6[blk.6.ffn_up.weight, ffn_norm-6]' op = MUL_MAT
nmse = 0.000953982168 tensor 'ffn_swiglu-6[ffn_gate-6, ffn_up-6]' op = GLU
nmse = 0.001042571479 tensor 'ffn_out-6[blk.6.ffn_down.weight, ffn_swiglu-6]' op = MUL_MAT
nmse = 0.000000210356 tensor 'l_out-6[ffn_out-6, ffn_inp-6]' op = ADD
nmse = 0.000315160836 tensor 'norm-7[l_out-6]' op = RMS_NORM
nmse = 0.000429744652 tensor 'attn_norm-7[norm-7, blk.7.attn_norm.weight]' op = MUL
nmse = 0.000292021135 tensor 'Qcur-7[blk.7.attn_q.weight, attn_norm-7]' op = MUL_MAT
nmse = 0.000731672738 tensor 'Vcur-7[blk.7.attn_v.weight, attn_norm-7]' op = MUL_MAT
nmse = 0.000324759743 tensor 'Kcur-7[blk.7.attn_k.weight, attn_norm-7]' op = MUL_MAT
nmse = 0.000252328438 tensor 'norm-7[Qcur-7 (reshaped)]' op = RMS_NORM
nmse = 0.000247171732 tensor 'Qcur_normed-7[norm-7, blk.7.attn_q_norm.weight]' op = MUL
nmse = 0.000247171638 tensor 'Qcur-7[Qcur_normed-7, leaf_5]' op = ROPE
nmse = 0.000269078388 tensor 'norm-7[Kcur-7 (reshaped)]' op = RMS_NORM
nmse = 0.000068388179 tensor 'Kcur_normed-7[norm-7, blk.7.attn_k_norm.weight]' op = MUL
nmse = 0.000068388161 tensor 'Kcur-7[Kcur_normed-7, leaf_5]' op = ROPE
nmse = 0.000068979210 tensor 'cache_k_l7 (view)[Kcur-7 (view), leaf_9, cache_k_l7]' op = SET_ROWS
nmse = 0.000732216648 tensor 'cache_v_l7 (reshaped) (view)[Vcur-7 (reshaped) (reshaped), leaf_11, cache_v_l7 (reshaped)]' op = SET_ROWS
nmse = 0.000075249944 tensor 'kq-7[cache_k_l7 (view) (permuted), Qcur-7 (view) (permuted)]' op = MUL_MAT
nmse = 0.000059560465 tensor 'kq_soft_max-7[kq-7, leaf_13]' op = SOFT_MAX
nmse = 0.000826361866 tensor 'kqv-7[cache_v_l7 (view) (permuted), kq_soft_max-7]' op = MUL_MAT
nmse = 0.000826361866 tensor 'kqv_out-7[kqv-7 (permuted)]' op = CONT
nmse = 0.001023804778 tensor 'node_312[blk.7.attn_output.weight, kqv_out-7]' op = MUL_MAT
nmse = 0.000000210647 tensor 'ffn_inp-7[node_312, l_out-6]' op = ADD
nmse = 0.000330856377 tensor 'norm-7[ffn_inp-7]' op = RMS_NORM
nmse = 0.000551333982 tensor 'ffn_norm-7[norm-7, blk.7.ffn_norm.weight]' op = MUL
nmse = 0.000295665275 tensor 'ffn_gate-7[blk.7.ffn_gate.weight, ffn_norm-7]' op = MUL_MAT
nmse = 0.000668858220 tensor 'ffn_up-7[blk.7.ffn_up.weight, ffn_norm-7]' op = MUL_MAT
nmse = 0.001398091080 tensor 'ffn_swiglu-7[ffn_gate-7, ffn_up-7]' op = GLU
nmse = 0.001438168164 tensor 'ffn_out-7[blk.7.ffn_down.weight, ffn_swiglu-7]' op = MUL_MAT
nmse = 0.000000218369 tensor 'l_out-7[ffn_out-7, ffn_inp-7]' op = ADD
nmse = 0.000366880543 tensor 'norm-8[l_out-7]' op = RMS_NORM
nmse = 0.000525020464 tensor 'attn_norm-8[norm-8, blk.8.attn_norm.weight]' op = MUL
nmse = 0.000304971683 tensor 'Qcur-8[blk.8.attn_q.weight, attn_norm-8]' op = MUL_MAT
nmse = 0.000709922014 tensor 'Vcur-8[blk.8.attn_v.weight, attn_norm-8]' op = MUL_MAT
nmse = 0.000405111681 tensor 'Kcur-8[blk.8.attn_k.weight, attn_norm-8]' op = MUL_MAT
nmse = 0.000269048098 tensor 'norm-8[Qcur-8 (reshaped)]' op = RMS_NORM
nmse = 0.000235421512 tensor 'Qcur_normed-8[norm-8, blk.8.attn_q_norm.weight]' op = MUL
nmse = 0.000235421528 tensor 'Qcur-8[Qcur_normed-8, leaf_5]' op = ROPE
nmse = 0.000318890134 tensor 'norm-8[Kcur-8 (reshaped)]' op = RMS_NORM
nmse = 0.000176054431 tensor 'Kcur_normed-8[norm-8, blk.8.attn_k_norm.weight]' op = MUL
nmse = 0.000176054453 tensor 'Kcur-8[Kcur_normed-8, leaf_5]' op = ROPE
nmse = 0.000176267818 tensor 'cache_k_l8 (view)[Kcur-8 (view), leaf_9, cache_k_l8]' op = SET_ROWS
nmse = 0.000709867457 tensor 'cache_v_l8 (reshaped) (view)[Vcur-8 (reshaped) (reshaped), leaf_11, cache_v_l8 (reshaped)]' op = SET_ROWS
nmse = 0.000037468984 tensor 'kq-8[cache_k_l8 (view) (permuted), Qcur-8 (view) (permuted)]' op = MUL_MAT
nmse = 0.000008443300 tensor 'kq_soft_max-8[kq-8, leaf_13]' op = SOFT_MAX
nmse = 0.000442654962 tensor 'kqv-8[cache_v_l8 (view) (permuted), kq_soft_max-8]' op = MUL_MAT
nmse = 0.000442654962 tensor 'kqv_out-8[kqv-8 (permuted)]' op = CONT
nmse = 0.000572985241 tensor 'node_352[blk.8.attn_output.weight, kqv_out-8]' op = MUL_MAT
nmse = 0.000000217921 tensor 'ffn_inp-8[node_352, l_out-7]' op = ADD
nmse = 0.000380450307 tensor 'norm-8[ffn_inp-8]' op = RMS_NORM
nmse = 0.000658084859 tensor 'ffn_norm-8[norm-8, blk.8.ffn_norm.weight]' op = MUL
nmse = 0.000352107649 tensor 'ffn_gate-8[blk.8.ffn_gate.weight, ffn_norm-8]' op = MUL_MAT
nmse = 0.000798612741 tensor 'ffn_up-8[blk.8.ffn_up.weight, ffn_norm-8]' op = MUL_MAT
nmse = 0.001451662056 tensor 'ffn_swiglu-8[ffn_gate-8, ffn_up-8]' op = GLU
nmse = 0.001385109121 tensor 'ffn_out-8[blk.8.ffn_down.weight, ffn_swiglu-8]' op = MUL_MAT
nmse = 0.000000228850 tensor 'l_out-8[ffn_out-8, ffn_inp-8]' op = ADD
nmse = 0.000395944565 tensor 'norm-9[l_out-8]' op = RMS_NORM
nmse = 0.000480961359 tensor 'attn_norm-9[norm-9, blk.9.attn_norm.weight]' op = MUL
nmse = 0.000301327996 tensor 'Qcur-9[blk.9.attn_q.weight, attn_norm-9]' op = MUL_MAT
nmse = 0.000933122848 tensor 'Vcur-9[blk.9.attn_v.weight, attn_norm-9]' op = MUL_MAT
nmse = 0.000419413444 tensor 'Kcur-9[blk.9.attn_k.weight, attn_norm-9]' op = MUL_MAT
nmse = 0.000340900723 tensor 'norm-9[Qcur-9 (reshaped)]' op = RMS_NORM
nmse = 0.000354967940 tensor 'Qcur_normed-9[norm-9, blk.9.attn_q_norm.weight]' op = MUL
nmse = 0.000354968032 tensor 'Qcur-9[Qcur_normed-9, leaf_5]' op = ROPE
nmse = 0.000348913343 tensor 'norm-9[Kcur-9 (reshaped)]' op = RMS_NORM
nmse = 0.000144724404 tensor 'Kcur_normed-9[norm-9, blk.9.attn_k_norm.weight]' op = MUL
nmse = 0.000144724322 tensor 'Kcur-9[Kcur_normed-9, leaf_5]' op = ROPE
nmse = 0.000144362604 tensor 'cache_k_l9 (view)[Kcur-9 (view), leaf_9, cache_k_l9]' op = SET_ROWS
nmse = 0.000933021849 tensor 'cache_v_l9 (reshaped) (view)[Vcur-9 (reshaped) (reshaped), leaf_11, cache_v_l9 (reshaped)]' op = SET_ROWS
nmse = 0.000059781942 tensor 'kq-9[cache_k_l9 (view) (permuted), Qcur-9 (view) (permuted)]' op = MUL_MAT
nmse = 0.000075209082 tensor 'kq_soft_max-9[kq-9, leaf_13]' op = SOFT_MAX
nmse = 0.001040650220 tensor 'kqv-9[cache_v_l9 (view) (permuted), kq_soft_max-9]' op = MUL_MAT
nmse = 0.001040650220 tensor 'kqv_out-9[kqv-9 (permuted)]' op = CONT
nmse = 0.001162761417 tensor 'node_392[blk.9.attn_output.weight, kqv_out-9]' op = MUL_MAT
nmse = 0.000000230168 tensor 'ffn_inp-9[node_392, l_out-8]' op = ADD
nmse = 0.000407619681 tensor 'norm-9[ffn_inp-9]' op = RMS_NORM
nmse = 0.000692207850 tensor 'ffn_norm-9[norm-9, blk.9.ffn_norm.weight]' op = MUL
nmse = 0.000417572119 tensor 'ffn_gate-9[blk.9.ffn_gate.weight, ffn_norm-9]' op = MUL_MAT
nmse = 0.000887384016 tensor 'ffn_up-9[blk.9.ffn_up.weight, ffn_norm-9]' op = MUL_MAT
nmse = 0.001562205271 tensor 'ffn_swiglu-9[ffn_gate-9, ffn_up-9]' op = GLU
nmse = 0.001490651634 tensor 'ffn_out-9[blk.9.ffn_down.weight, ffn_swiglu-9]' op = MUL_MAT
nmse = 0.000000246082 tensor 'l_out-9[ffn_out-9, ffn_inp-9]' op = ADD
nmse = 0.000412616649 tensor 'norm-10[l_out-9]' op = RMS_NORM
nmse = 0.000478225341 tensor 'attn_norm-10[norm-10, blk.10.attn_norm.weight]' op = MUL
nmse = 0.000358536455 tensor 'Qcur-10[blk.10.attn_q.weight, attn_norm-10]' op = MUL_MAT
nmse = 0.000948984074 tensor 'Vcur-10[blk.10.attn_v.weight, attn_norm-10]' op = MUL_MAT
nmse = 0.000491864308 tensor 'Kcur-10[blk.10.attn_k.weight, attn_norm-10]' op = MUL_MAT
nmse = 0.000307570018 tensor 'norm-10[Qcur-10 (reshaped)]' op = RMS_NORM
nmse = 0.000309583723 tensor 'Qcur_normed-10[norm-10, blk.10.attn_q_norm.weight]' op = MUL
nmse = 0.000309583678 tensor 'Qcur-10[Qcur_normed-10, leaf_5]' op = ROPE
nmse = 0.000418917686 tensor 'norm-10[Kcur-10 (reshaped)]' op = RMS_NORM
nmse = 0.000119249474 tensor 'Kcur_normed-10[norm-10, blk.10.attn_k_norm.weight]' op = MUL
nmse = 0.000119249564 tensor 'Kcur-10[Kcur_normed-10, leaf_5]' op = ROPE
nmse = 0.000120008319 tensor 'cache_k_l10 (view)[Kcur-10 (view), leaf_9, cache_k_l10]' op = SET_ROWS
nmse = 0.000948608979 tensor 'cache_v_l10 (reshaped) (view)[Vcur-10 (reshaped) (reshaped), leaf_11, cache_v_l10 (reshaped)]' op = SET_ROWS
nmse = 0.000145655479 tensor 'kq-10[cache_k_l10 (view) (permuted), Qcur-10 (view) (permuted)]' op = MUL_MAT
nmse = 0.000057558315 tensor 'kq_soft_max-10[kq-10, leaf_13]' op = SOFT_MAX
nmse = 0.000681128405 tensor 'kqv-10[cache_v_l10 (view) (permuted), kq_soft_max-10]' op = MUL_MAT
nmse = 0.000681128405 tensor 'kqv_out-10[kqv-10 (permuted)]' op = CONT
nmse = 0.000839750498 tensor 'node_432[blk.10.attn_output.weight, kqv_out-10]' op = MUL_MAT
nmse = 0.000000248519 tensor 'ffn_inp-10[node_432, l_out-9]' op = ADD
nmse = 0.000411245309 tensor 'norm-10[ffn_inp-10]' op = RMS_NORM
nmse = 0.000657942673 tensor 'ffn_norm-10[norm-10, blk.10.ffn_norm.weight]' op = MUL
nmse = 0.000427536406 tensor 'ffn_gate-10[blk.10.ffn_gate.weight, ffn_norm-10]' op = MUL_MAT
nmse = 0.000729467851 tensor 'ffn_up-10[blk.10.ffn_up.weight, ffn_norm-10]' op = MUL_MAT
nmse = 0.001687893020 tensor 'ffn_swiglu-10[ffn_gate-10, ffn_up-10]' op = GLU
nmse = 0.001678220926 tensor 'ffn_out-10[blk.10.ffn_down.weight, ffn_swiglu-10]' op = MUL_MAT
nmse = 0.000000272172 tensor 'l_out-10[ffn_out-10, ffn_inp-10]' op = ADD
nmse = 0.000458786804 tensor 'norm-11[l_out-10]' op = RMS_NORM
nmse = 0.000405567017 tensor 'attn_norm-11[norm-11, blk.11.attn_norm.weight]' op = MUL
nmse = 0.000203121704 tensor 'Qcur-11[blk.11.attn_q.weight, attn_norm-11]' op = MUL_MAT
nmse = 0.000900053921 tensor 'Vcur-11[blk.11.attn_v.weight, attn_norm-11]' op = MUL_MAT
nmse = 0.000396874648 tensor 'Kcur-11[blk.11.attn_k.weight, attn_norm-11]' op = MUL_MAT
nmse = 0.000178327389 tensor 'norm-11[Qcur-11 (reshaped)]' op = RMS_NORM
nmse = 0.000289913503 tensor 'Qcur_normed-11[norm-11, blk.11.attn_q_norm.weight]' op = MUL
nmse = 0.000289913533 tensor 'Qcur-11[Qcur_normed-11, leaf_5]' op = ROPE
nmse = 0.000356537536 tensor 'norm-11[Kcur-11 (reshaped)]' op = RMS_NORM
nmse = 0.000237955774 tensor 'Kcur_normed-11[norm-11, blk.11.attn_k_norm.weight]' op = MUL
nmse = 0.000237955810 tensor 'Kcur-11[Kcur_normed-11, leaf_5]' op = ROPE
nmse = 0.000237923676 tensor 'cache_k_l11 (view)[Kcur-11 (view), leaf_9, cache_k_l11]' op = SET_ROWS
nmse = 0.000901014747 tensor 'cache_v_l11 (reshaped) (view)[Vcur-11 (reshaped) (reshaped), leaf_11, cache_v_l11 (reshaped)]' op = SET_ROWS
nmse = 0.000045405057 tensor 'kq-11[cache_k_l11 (view) (permuted), Qcur-11 (view) (permuted)]' op = MUL_MAT
nmse = 0.000067134017 tensor 'kq_soft_max-11[kq-11, leaf_13]' op = SOFT_MAX
nmse = 0.000822824236 tensor 'kqv-11[cache_v_l11 (view) (permuted), kq_soft_max-11]' op = MUL_MAT
nmse = 0.000822824236 tensor 'kqv_out-11[kqv-11 (permuted)]' op = CONT
nmse = 0.000737509238 tensor 'node_472[blk.11.attn_output.weight, kqv_out-11]' op = MUL_MAT
nmse = 0.000000281431 tensor 'ffn_inp-11[node_472, l_out-10]' op = ADD
nmse = 0.000398139165 tensor 'norm-11[ffn_inp-11]' op = RMS_NORM
nmse = 0.000818705805 tensor 'ffn_norm-11[norm-11, blk.11.ffn_norm.weight]' op = MUL
nmse = 0.000537882983 tensor 'ffn_gate-11[blk.11.ffn_gate.weight, ffn_norm-11]' op = MUL_MAT
nmse = 0.001015450797 tensor 'ffn_up-11[blk.11.ffn_up.weight, ffn_norm-11]' op = MUL_MAT
nmse = 0.001801906913 tensor 'ffn_swiglu-11[ffn_gate-11, ffn_up-11]' op = GLU
nmse = 0.001701320529 tensor 'ffn_out-11[blk.11.ffn_down.weight, ffn_swiglu-11]' op = MUL_MAT
nmse = 0.000000290626 tensor 'l_out-11[ffn_out-11, ffn_inp-11]' op = ADD
nmse = 0.000376443977 tensor 'norm-12[l_out-11]' op = RMS_NORM
nmse = 0.000560231754 tensor 'attn_norm-12[norm-12, blk.12.attn_norm.weight]' op = MUL
nmse = 0.000311194845 tensor 'Qcur-12[blk.12.attn_q.weight, attn_norm-12]' op = MUL_MAT
nmse = 0.001269880821 tensor 'Vcur-12[blk.12.attn_v.weight, attn_norm-12]' op = MUL_MAT
nmse = 0.000645337898 tensor 'Kcur-12[blk.12.attn_k.weight, attn_norm-12]' op = MUL_MAT
nmse = 0.000259093111 tensor 'norm-12[Qcur-12 (reshaped)]' op = RMS_NORM
nmse = 0.000371505964 tensor 'Qcur_normed-12[norm-12, blk.12.attn_q_norm.weight]' op = MUL
nmse = 0.000371505972 tensor 'Qcur-12[Qcur_normed-12, leaf_5]' op = ROPE
nmse = 0.000575721486 tensor 'norm-12[Kcur-12 (reshaped)]' op = RMS_NORM
nmse = 0.000152631088 tensor 'Kcur_normed-12[norm-12, blk.12.attn_k_norm.weight]' op = MUL
nmse = 0.000152631032 tensor 'Kcur-12[Kcur_normed-12, leaf_5]' op = ROPE
nmse = 0.000152427402 tensor 'cache_k_l12 (view)[Kcur-12 (view), leaf_9, cache_k_l12]' op = SET_ROWS
nmse = 0.001269882223 tensor 'cache_v_l12 (reshaped) (view)[Vcur-12 (reshaped) (reshaped), leaf_11, cache_v_l12 (reshaped)]' op = SET_ROWS
nmse = 0.000082635135 tensor 'kq-12[cache_k_l12 (view) (permuted), Qcur-12 (view) (permuted)]' op = MUL_MAT
nmse = 0.000033394107 tensor 'kq_soft_max-12[kq-12, leaf_13]' op = SOFT_MAX
nmse = 0.000859604872 tensor 'kqv-12[cache_v_l12 (view) (permuted), kq_soft_max-12]' op = MUL_MAT
nmse = 0.000859604872 tensor 'kqv_out-12[kqv-12 (permuted)]' op = CONT
nmse = 0.001008879105 tensor 'node_512[blk.12.attn_output.weight, kqv_out-12]' op = MUL_MAT
nmse = 0.000000291548 tensor 'ffn_inp-12[node_512, l_out-11]' op = ADD
nmse = 0.000394604483 tensor 'norm-12[ffn_inp-12]' op = RMS_NORM
nmse = 0.000868706034 tensor 'ffn_norm-12[norm-12, blk.12.ffn_norm.weight]' op = MUL
nmse = 0.000582391144 tensor 'ffn_gate-12[blk.12.ffn_gate.weight, ffn_norm-12]' op = MUL_MAT
nmse = 0.001100063330 tensor 'ffn_up-12[blk.12.ffn_up.weight, ffn_norm-12]' op = MUL_MAT
nmse = 0.001812017109 tensor 'ffn_swiglu-12[ffn_gate-12, ffn_up-12]' op = GLU
nmse = 0.001930579092 tensor 'ffn_out-12[blk.12.ffn_down.weight, ffn_swiglu-12]' op = MUL_MAT
nmse = 0.000000304193 tensor 'l_out-12[ffn_out-12, ffn_inp-12]' op = ADD
nmse = 0.000386524507 tensor 'norm-13[l_out-12]' op = RMS_NORM
nmse = 0.000503578322 tensor 'attn_norm-13[norm-13, blk.13.attn_norm.weight]' op = MUL
nmse = 0.000252130785 tensor 'Qcur-13[blk.13.attn_q.weight, attn_norm-13]' op = MUL_MAT
nmse = 0.001153804217 tensor 'Vcur-13[blk.13.attn_v.weight, attn_norm-13]' op = MUL_MAT
nmse = 0.000466114148 tensor 'Kcur-13[blk.13.attn_k.weight, attn_norm-13]' op = MUL_MAT
nmse = 0.000295038536 tensor 'norm-13[Qcur-13 (reshaped)]' op = RMS_NORM
nmse = 0.000315586298 tensor 'Qcur_normed-13[norm-13, blk.13.attn_q_norm.weight]' op = MUL
nmse = 0.000315586230 tensor 'Qcur-13[Qcur_normed-13, leaf_5]' op = ROPE
nmse = 0.000391508421 tensor 'norm-13[Kcur-13 (reshaped)]' op = RMS_NORM
nmse = 0.000347694441 tensor 'Kcur_normed-13[norm-13, blk.13.attn_k_norm.weight]' op = MUL
nmse = 0.000347694437 tensor 'Kcur-13[Kcur_normed-13, leaf_5]' op = ROPE
nmse = 0.000347435882 tensor 'cache_k_l13 (view)[Kcur-13 (view), leaf_9, cache_k_l13]' op = SET_ROWS
nmse = 0.001153860053 tensor 'cache_v_l13 (reshaped) (view)[Vcur-13 (reshaped) (reshaped), leaf_11, cache_v_l13 (reshaped)]' op = SET_ROWS
nmse = 0.000042116752 tensor 'kq-13[cache_k_l13 (view) (permuted), Qcur-13 (view) (permuted)]' op = MUL_MAT
nmse = 0.000150153018 tensor 'kq_soft_max-13[kq-13, leaf_13]' op = SOFT_MAX
nmse = 0.001187372836 tensor 'kqv-13[cache_v_l13 (view) (permuted), kq_soft_max-13]' op = MUL_MAT
nmse = 0.001187372836 tensor 'kqv_out-13[kqv-13 (permuted)]' op = CONT
nmse = 0.001973498685 tensor 'node_552[blk.13.attn_output.weight, kqv_out-13]' op = MUL_MAT
nmse = 0.000000320325 tensor 'ffn_inp-13[node_552, l_out-12]' op = ADD
nmse = 0.000429929515 tensor 'norm-13[ffn_inp-13]' op = RMS_NORM
nmse = 0.000988342936 tensor 'ffn_norm-13[norm-13, blk.13.ffn_norm.weight]' op = MUL
nmse = 0.000639075059 tensor 'ffn_gate-13[blk.13.ffn_gate.weight, ffn_norm-13]' op = MUL_MAT
nmse = 0.001170186890 tensor 'ffn_up-13[blk.13.ffn_up.weight, ffn_norm-13]' op = MUL_MAT
nmse = 0.002019142980 tensor 'ffn_swiglu-13[ffn_gate-13, ffn_up-13]' op = GLU
nmse = 0.002267712615 tensor 'ffn_out-13[blk.13.ffn_down.weight, ffn_swiglu-13]' op = MUL_MAT
nmse = 0.000000323544 tensor 'l_out-13[ffn_out-13, ffn_inp-13]' op = ADD
nmse = 0.000415827663 tensor 'norm-14[l_out-13]' op = RMS_NORM
nmse = 0.000594463817 tensor 'attn_norm-14[norm-14, blk.14.attn_norm.weight]' op = MUL
nmse = 0.000379862428 tensor 'Qcur-14[blk.14.attn_q.weight, attn_norm-14]' op = MUL_MAT
nmse = 0.001446750737 tensor 'Vcur-14[blk.14.attn_v.weight, attn_norm-14]' op = MUL_MAT
nmse = 0.000500420981 tensor 'Kcur-14[blk.14.attn_k.weight, attn_norm-14]' op = MUL_MAT
nmse = 0.000296478733 tensor 'norm-14[Qcur-14 (reshaped)]' op = RMS_NORM
nmse = 0.000321239124 tensor 'Qcur_normed-14[norm-14, blk.14.attn_q_norm.weight]' op = MUL
nmse = 0.000321239114 tensor 'Qcur-14[Qcur_normed-14, leaf_5]' op = ROPE
nmse = 0.000425370703 tensor 'norm-14[Kcur-14 (reshaped)]' op = RMS_NORM
nmse = 0.000316754184 tensor 'Kcur_normed-14[norm-14, blk.14.attn_k_norm.weight]' op = MUL
nmse = 0.000316754065 tensor 'Kcur-14[Kcur_normed-14, leaf_5]' op = ROPE
nmse = 0.000316467299 tensor 'cache_k_l14 (view)[Kcur-14 (view), leaf_9, cache_k_l14]' op = SET_ROWS
nmse = 0.001447276507 tensor 'cache_v_l14 (reshaped) (view)[Vcur-14 (reshaped) (reshaped), leaf_11, cache_v_l14 (reshaped)]' op = SET_ROWS
nmse = 0.000081970512 tensor 'kq-14[cache_k_l14 (view) (permuted), Qcur-14 (view) (permuted)]' op = MUL_MAT
nmse = 0.000073499904 tensor 'kq_soft_max-14[kq-14, leaf_13]' op = SOFT_MAX
nmse = 0.001370590090 tensor 'kqv-14[cache_v_l14 (view) (permuted), kq_soft_max-14]' op = MUL_MAT
nmse = 0.001370590090 tensor 'kqv_out-14[kqv-14 (permuted)]' op = CONT
nmse = 0.001301140585 tensor 'node_592[blk.14.attn_output.weight, kqv_out-14]' op = MUL_MAT
nmse = 0.000000321948 tensor 'ffn_inp-14[node_592, l_out-13]' op = ADD
nmse = 0.000446287172 tensor 'norm-14[ffn_inp-14]' op = RMS_NORM
nmse = 0.000988827906 tensor 'ffn_norm-14[norm-14, blk.14.ffn_norm.weight]' op = MUL
nmse = 0.000624560322 tensor 'ffn_gate-14[blk.14.ffn_gate.weight, ffn_norm-14]' op = MUL_MAT
nmse = 0.001209351488 tensor 'ffn_up-14[blk.14.ffn_up.weight, ffn_norm-14]' op = MUL_MAT
nmse = 0.001806053088 tensor 'ffn_swiglu-14[ffn_gate-14, ffn_up-14]' op = GLU
nmse = 0.001707979725 tensor 'ffn_out-14[blk.14.ffn_down.weight, ffn_swiglu-14]' op = MUL_MAT
nmse = 0.000000333474 tensor 'l_out-14[ffn_out-14, ffn_inp-14]' op = ADD
nmse = 0.000431480269 tensor 'norm-15[l_out-14]' op = RMS_NORM
nmse = 0.000547556914 tensor 'attn_norm-15[norm-15, blk.15.attn_norm.weight]' op = MUL
nmse = 0.000323672291 tensor 'Qcur-15[blk.15.attn_q.weight, attn_norm-15]' op = MUL_MAT
nmse = 0.001495011133 tensor 'Vcur-15[blk.15.attn_v.weight, attn_norm-15]' op = MUL_MAT
nmse = 0.000482107558 tensor 'Kcur-15[blk.15.attn_k.weight, attn_norm-15]' op = MUL_MAT
nmse = 0.000260227883 tensor 'norm-15[Qcur-15 (reshaped)]' op = RMS_NORM
nmse = 0.000313017439 tensor 'Qcur_normed-15[norm-15, blk.15.attn_q_norm.weight]' op = MUL
nmse = 0.000313017427 tensor 'Qcur-15[Qcur_normed-15, leaf_5]' op = ROPE
nmse = 0.000385594435 tensor 'norm-15[Kcur-15 (reshaped)]' op = RMS_NORM
nmse = 0.000113879302 tensor 'Kcur_normed-15[norm-15, blk.15.attn_k_norm.weight]' op = MUL
nmse = 0.000113879312 tensor 'Kcur-15[Kcur_normed-15, leaf_5]' op = ROPE
nmse = 0.000114182953 tensor 'cache_k_l15 (view)[Kcur-15 (view), leaf_9, cache_k_l15]' op = SET_ROWS
nmse = 0.001495587356 tensor 'cache_v_l15 (reshaped) (view)[Vcur-15 (reshaped) (reshaped), leaf_11, cache_v_l15 (reshaped)]' op = SET_ROWS
nmse = 0.000053414664 tensor 'kq-15[cache_k_l15 (view) (permuted), Qcur-15 (view) (permuted)]' op = MUL_MAT
nmse = 0.000032565436 tensor 'kq_soft_max-15[kq-15, leaf_13]' op = SOFT_MAX
nmse = 0.000851018091 tensor 'kqv-15[cache_v_l15 (view) (permuted), kq_soft_max-15]' op = MUL_MAT
nmse = 0.000851018091 tensor 'kqv_out-15[kqv-15 (permuted)]' op = CONT
nmse = 0.001138545065 tensor 'node_632[blk.15.attn_output.weight, kqv_out-15]' op = MUL_MAT
nmse = 0.000000343878 tensor 'ffn_inp-15[node_632, l_out-14]' op = ADD
nmse = 0.000441609039 tensor 'norm-15[ffn_inp-15]' op = RMS_NORM
nmse = 0.000777554270 tensor 'ffn_norm-15[norm-15, blk.15.ffn_norm.weight]' op = MUL
nmse = 0.000496860860 tensor 'ffn_gate-15[blk.15.ffn_gate.weight, ffn_norm-15]' op = MUL_MAT
nmse = 0.000903772542 tensor 'ffn_up-15[blk.15.ffn_up.weight, ffn_norm-15]' op = MUL_MAT
nmse = 0.001562816711 tensor 'ffn_swiglu-15[ffn_gate-15, ffn_up-15]' op = GLU
nmse = 0.001832168031 tensor 'ffn_out-15[blk.15.ffn_down.weight, ffn_swiglu-15]' op = MUL_MAT
nmse = 0.000000365395 tensor 'l_out-15[ffn_out-15, ffn_inp-15]' op = ADD
nmse = 0.000453349514 tensor 'norm-16[l_out-15]' op = RMS_NORM
nmse = 0.000481959060 tensor 'attn_norm-16[norm-16, blk.16.attn_norm.weight]' op = MUL
nmse = 0.000285724132 tensor 'Qcur-16[blk.16.attn_q.weight, attn_norm-16]' op = MUL_MAT
nmse = 0.001190868484 tensor 'Vcur-16[blk.16.attn_v.weight, attn_norm-16]' op = MUL_MAT
nmse = 0.000466490700 tensor 'Kcur-16[blk.16.attn_k.weight, attn_norm-16]' op = MUL_MAT
nmse = 0.000216053189 tensor 'norm-16[Qcur-16 (reshaped)]' op = RMS_NORM
nmse = 0.000297872940 tensor 'Qcur_normed-16[norm-16, blk.16.attn_q_norm.weight]' op = MUL
nmse = 0.000297872955 tensor 'Qcur-16[Qcur_normed-16, leaf_5]' op = ROPE
nmse = 0.000360479287 tensor 'norm-16[Kcur-16 (reshaped)]' op = RMS_NORM
nmse = 0.000308220567 tensor 'Kcur_normed-16[norm-16, blk.16.attn_k_norm.weight]' op = MUL
nmse = 0.000308220551 tensor 'Kcur-16[Kcur_normed-16, leaf_5]' op = ROPE
nmse = 0.000307969566 tensor 'cache_k_l16 (view)[Kcur-16 (view), leaf_9, cache_k_l16]' op = SET_ROWS
nmse = 0.001190809612 tensor 'cache_v_l16 (reshaped) (view)[Vcur-16 (reshaped) (reshaped), leaf_11, cache_v_l16 (reshaped)]' op = SET_ROWS
nmse = 0.000053694930 tensor 'kq-16[cache_k_l16 (view) (permuted), Qcur-16 (view) (permuted)]' op = MUL_MAT
nmse = 0.000064732454 tensor 'kq_soft_max-16[kq-16, leaf_13]' op = SOFT_MAX
nmse = 0.001114029233 tensor 'kqv-16[cache_v_l16 (view) (permuted), kq_soft_max-16]' op = MUL_MAT
nmse = 0.001114029233 tensor 'kqv_out-16[kqv-16 (permuted)]' op = CONT
nmse = 0.001190730402 tensor 'node_672[blk.16.attn_output.weight, kqv_out-16]' op = MUL_MAT
nmse = 0.000000390393 tensor 'ffn_inp-16[node_672, l_out-15]' op = ADD
nmse = 0.000479653921 tensor 'norm-16[ffn_inp-16]' op = RMS_NORM
nmse = 0.000686639609 tensor 'ffn_norm-16[norm-16, blk.16.ffn_norm.weight]' op = MUL
nmse = 0.000385799178 tensor 'ffn_gate-16[blk.16.ffn_gate.weight, ffn_norm-16]' op = MUL_MAT
nmse = 0.000788659877 tensor 'ffn_up-16[blk.16.ffn_up.weight, ffn_norm-16]' op = MUL_MAT
nmse = 0.001236284368 tensor 'ffn_swiglu-16[ffn_gate-16, ffn_up-16]' op = GLU
nmse = 0.001271186534 tensor 'ffn_out-16[blk.16.ffn_down.weight, ffn_swiglu-16]' op = MUL_MAT
nmse = 0.000000479966 tensor 'l_out-16[ffn_out-16, ffn_inp-16]' op = ADD
nmse = 0.000426569607 tensor 'norm-17[l_out-16]' op = RMS_NORM
nmse = 0.000452849186 tensor 'attn_norm-17[norm-17, blk.17.attn_norm.weight]' op = MUL
nmse = 0.000301684256 tensor 'Qcur-17[blk.17.attn_q.weight, attn_norm-17]' op = MUL_MAT
nmse = 0.001279348846 tensor 'Vcur-17[blk.17.attn_v.weight, attn_norm-17]' op = MUL_MAT
nmse = 0.000409825242 tensor 'Kcur-17[blk.17.attn_k.weight, attn_norm-17]' op = MUL_MAT
nmse = 0.000250687029 tensor 'norm-17[Qcur-17 (reshaped)]' op = RMS_NORM
nmse = 0.000338437777 tensor 'Qcur_normed-17[norm-17, blk.17.attn_q_norm.weight]' op = MUL
nmse = 0.000338437825 tensor 'Qcur-17[Qcur_normed-17, leaf_5]' op = ROPE
nmse = 0.000339215076 tensor 'norm-17[Kcur-17 (reshaped)]' op = RMS_NORM
nmse = 0.000200279298 tensor 'Kcur_normed-17[norm-17, blk.17.attn_k_norm.weight]' op = MUL
nmse = 0.000200279317 tensor 'Kcur-17[Kcur_normed-17, leaf_5]' op = ROPE
nmse = 0.000200170056 tensor 'cache_k_l17 (view)[Kcur-17 (view), leaf_9, cache_k_l17]' op = SET_ROWS
nmse = 0.001279429924 tensor 'cache_v_l17 (reshaped) (view)[Vcur-17 (reshaped) (reshaped), leaf_11, cache_v_l17 (reshaped)]' op = SET_ROWS
nmse = 0.000062596424 tensor 'kq-17[cache_k_l17 (view) (permuted), Qcur-17 (view) (permuted)]' op = MUL_MAT
nmse = 0.000065037363 tensor 'kq_soft_max-17[kq-17, leaf_13]' op = SOFT_MAX
nmse = 0.000989304031 tensor 'kqv-17[cache_v_l17 (view) (permuted), kq_soft_max-17]' op = MUL_MAT
nmse = 0.000989304031 tensor 'kqv_out-17[kqv-17 (permuted)]' op = CONT
nmse = 0.001564420413 tensor 'node_712[blk.17.attn_output.weight, kqv_out-17]' op = MUL_MAT
nmse = 0.000000540212 tensor 'ffn_inp-17[node_712, l_out-16]' op = ADD
nmse = 0.000456130549 tensor 'norm-17[ffn_inp-17]' op = RMS_NORM
nmse = 0.000636459781 tensor 'ffn_norm-17[norm-17, blk.17.ffn_norm.weight]' op = MUL
nmse = 0.000388024715 tensor 'ffn_gate-17[blk.17.ffn_gate.weight, ffn_norm-17]' op = MUL_MAT
nmse = 0.000737264053 tensor 'ffn_up-17[blk.17.ffn_up.weight, ffn_norm-17]' op = MUL_MAT
nmse = 0.001552071417 tensor 'ffn_swiglu-17[ffn_gate-17, ffn_up-17]' op = GLU
nmse = 0.001751136323 tensor 'ffn_out-17[blk.17.ffn_down.weight, ffn_swiglu-17]' op = MUL_MAT
nmse = 0.000000598210 tensor 'l_out-17[ffn_out-17, ffn_inp-17]' op = ADD
nmse = 0.000461487752 tensor 'norm-18[l_out-17]' op = RMS_NORM
nmse = 0.000440213516 tensor 'attn_norm-18[norm-18, blk.18.attn_norm.weight]' op = MUL
nmse = 0.000278014403 tensor 'Qcur-18[blk.18.attn_q.weight, attn_norm-18]' op = MUL_MAT
nmse = 0.001061595219 tensor 'Vcur-18[blk.18.attn_v.weight, attn_norm-18]' op = MUL_MAT
nmse = 0.000368174584 tensor 'Kcur-18[blk.18.attn_k.weight, attn_norm-18]' op = MUL_MAT
nmse = 0.000222618207 tensor 'norm-18[Qcur-18 (reshaped)]' op = RMS_NORM
nmse = 0.000271271193 tensor 'Qcur_normed-18[norm-18, blk.18.attn_q_norm.weight]' op = MUL
nmse = 0.000271271195 tensor 'Qcur-18[Qcur_normed-18, leaf_5]' op = ROPE
nmse = 0.000302044875 tensor 'norm-18[Kcur-18 (reshaped)]' op = RMS_NORM
nmse = 0.000248010587 tensor 'Kcur_normed-18[norm-18, blk.18.attn_k_norm.weight]' op = MUL
nmse = 0.000248010593 tensor 'Kcur-18[Kcur_normed-18, leaf_5]' op = ROPE
nmse = 0.000247931195 tensor 'cache_k_l18 (view)[Kcur-18 (view), leaf_9, cache_k_l18]' op = SET_ROWS
nmse = 0.001061865624 tensor 'cache_v_l18 (reshaped) (view)[Vcur-18 (reshaped) (reshaped), leaf_11, cache_v_l18 (reshaped)]' op = SET_ROWS
nmse = 0.000054068165 tensor 'kq-18[cache_k_l18 (view) (permuted), Qcur-18 (view) (permuted)]' op = MUL_MAT
nmse = 0.000109239503 tensor 'kq_soft_max-18[kq-18, leaf_13]' op = SOFT_MAX
nmse = 0.001407583995 tensor 'kqv-18[cache_v_l18 (view) (permuted), kq_soft_max-18]' op = MUL_MAT
nmse = 0.001407583995 tensor 'kqv_out-18[kqv-18 (permuted)]' op = CONT
nmse = 0.001493759764 tensor 'node_752[blk.18.attn_output.weight, kqv_out-18]' op = MUL_MAT
nmse = 0.000000670651 tensor 'ffn_inp-18[node_752, l_out-17]' op = ADD
nmse = 0.000463325677 tensor 'norm-18[ffn_inp-18]' op = RMS_NORM
nmse = 0.000625961718 tensor 'ffn_norm-18[norm-18, blk.18.ffn_norm.weight]' op = MUL
nmse = 0.000345023310 tensor 'ffn_gate-18[blk.18.ffn_gate.weight, ffn_norm-18]' op = MUL_MAT
nmse = 0.000561422371 tensor 'ffn_up-18[blk.18.ffn_up.weight, ffn_norm-18]' op = MUL_MAT
nmse = 0.001384766257 tensor 'ffn_swiglu-18[ffn_gate-18, ffn_up-18]' op = GLU
nmse = 0.001530681136 tensor 'ffn_out-18[blk.18.ffn_down.weight, ffn_swiglu-18]' op = MUL_MAT
nmse = 0.000000792879 tensor 'l_out-18[ffn_out-18, ffn_inp-18]' op = ADD
nmse = 0.000475788206 tensor 'norm-19[l_out-18]' op = RMS_NORM
nmse = 0.000466119003 tensor 'attn_norm-19[norm-19, blk.19.attn_norm.weight]' op = MUL
nmse = 0.000357434526 tensor 'Qcur-19[blk.19.attn_q.weight, attn_norm-19]' op = MUL_MAT
nmse = 0.000946090938 tensor 'Vcur-19[blk.19.attn_v.weight, attn_norm-19]' op = MUL_MAT
nmse = 0.000404257238 tensor 'Kcur-19[blk.19.attn_k.weight, attn_norm-19]' op = MUL_MAT
nmse = 0.000294086141 tensor 'norm-19[Qcur-19 (reshaped)]' op = RMS_NORM
nmse = 0.000303598932 tensor 'Qcur_normed-19[norm-19, blk.19.attn_q_norm.weight]' op = MUL
nmse = 0.000303598939 tensor 'Qcur-19[Qcur_normed-19, leaf_5]' op = ROPE
nmse = 0.000329234893 tensor 'norm-19[Kcur-19 (reshaped)]' op = RMS_NORM
nmse = 0.000246306621 tensor 'Kcur_normed-19[norm-19, blk.19.attn_k_norm.weight]' op = MUL
nmse = 0.000246306656 tensor 'Kcur-19[Kcur_normed-19, leaf_5]' op = ROPE
nmse = 0.000246731171 tensor 'cache_k_l19 (view)[Kcur-19 (view), leaf_9, cache_k_l19]' op = SET_ROWS
nmse = 0.000946036328 tensor 'cache_v_l19 (reshaped) (view)[Vcur-19 (reshaped) (reshaped), leaf_11, cache_v_l19 (reshaped)]' op = SET_ROWS
nmse = 0.000058162725 tensor 'kq-19[cache_k_l19 (view) (permuted), Qcur-19 (view) (permuted)]' op = MUL_MAT
nmse = 0.000174914688 tensor 'kq_soft_max-19[kq-19, leaf_13]' op = SOFT_MAX
nmse = 0.001899746376 tensor 'kqv-19[cache_v_l19 (view) (permuted), kq_soft_max-19]' op = MUL_MAT
nmse = 0.001899746376 tensor 'kqv_out-19[kqv-19 (permuted)]' op = CONT
nmse = 0.001510917525 tensor 'node_792[blk.19.attn_output.weight, kqv_out-19]' op = MUL_MAT
nmse = 0.000000909196 tensor 'ffn_inp-19[node_792, l_out-18]' op = ADD
nmse = 0.000500152273 tensor 'norm-19[ffn_inp-19]' op = RMS_NORM
nmse = 0.000650107866 tensor 'ffn_norm-19[norm-19, blk.19.ffn_norm.weight]' op = MUL
nmse = 0.000327938354 tensor 'ffn_gate-19[blk.19.ffn_gate.weight, ffn_norm-19]' op = MUL_MAT
nmse = 0.000532134569 tensor 'ffn_up-19[blk.19.ffn_up.weight, ffn_norm-19]' op = MUL_MAT
nmse = 0.000955395800 tensor 'ffn_swiglu-19[ffn_gate-19, ffn_up-19]' op = GLU
nmse = 0.000948960790 tensor 'ffn_out-19[blk.19.ffn_down.weight, ffn_swiglu-19]' op = MUL_MAT
nmse = 0.000001153743 tensor 'l_out-19[ffn_out-19, ffn_inp-19]' op = ADD
nmse = 0.000459461158 tensor 'norm-20[l_out-19]' op = RMS_NORM
nmse = 0.000406727197 tensor 'attn_norm-20[norm-20, blk.20.attn_norm.weight]' op = MUL
nmse = 0.000279863701 tensor 'Qcur-20[blk.20.attn_q.weight, attn_norm-20]' op = MUL_MAT
nmse = 0.000771072780 tensor 'Vcur-20[blk.20.attn_v.weight, attn_norm-20]' op = MUL_MAT
nmse = 0.000465634415 tensor 'Kcur-20[blk.20.attn_k.weight, attn_norm-20]' op = MUL_MAT
nmse = 0.000222752123 tensor 'norm-20[Qcur-20 (reshaped)]' op = RMS_NORM
nmse = 0.000240396720 tensor 'Qcur_normed-20[norm-20, blk.20.attn_q_norm.weight]' op = MUL
nmse = 0.000240396696 tensor 'Qcur-20[Qcur_normed-20, leaf_5]' op = ROPE
nmse = 0.000355079381 tensor 'norm-20[Kcur-20 (reshaped)]' op = RMS_NORM
nmse = 0.000304159625 tensor 'Kcur_normed-20[norm-20, blk.20.attn_k_norm.weight]' op = MUL
nmse = 0.000304159640 tensor 'Kcur-20[Kcur_normed-20, leaf_5]' op = ROPE
nmse = 0.000303776346 tensor 'cache_k_l20 (view)[Kcur-20 (view), leaf_9, cache_k_l20]' op = SET_ROWS
nmse = 0.000770952483 tensor 'cache_v_l20 (reshaped) (view)[Vcur-20 (reshaped) (reshaped), leaf_11, cache_v_l20 (reshaped)]' op = SET_ROWS
nmse = 0.000067624789 tensor 'kq-20[cache_k_l20 (view) (permuted), Qcur-20 (view) (permuted)]' op = MUL_MAT
nmse = 0.000054696585 tensor 'kq_soft_max-20[kq-20, leaf_13]' op = SOFT_MAX
nmse = 0.001118368728 tensor 'kqv-20[cache_v_l20 (view) (permuted), kq_soft_max-20]' op = MUL_MAT
nmse = 0.001118368728 tensor 'kqv_out-20[kqv-20 (permuted)]' op = CONT
nmse = 0.001068394563 tensor 'node_832[blk.20.attn_output.weight, kqv_out-20]' op = MUL_MAT
nmse = 0.000001272778 tensor 'ffn_inp-20[node_832, l_out-19]' op = ADD
nmse = 0.000462952844 tensor 'norm-20[ffn_inp-20]' op = RMS_NORM
nmse = 0.000578349208 tensor 'ffn_norm-20[norm-20, blk.20.ffn_norm.weight]' op = MUL
nmse = 0.000369285375 tensor 'ffn_gate-20[blk.20.ffn_gate.weight, ffn_norm-20]' op = MUL_MAT
nmse = 0.000532771005 tensor 'ffn_up-20[blk.20.ffn_up.weight, ffn_norm-20]' op = MUL_MAT
nmse = 0.001360344680 tensor 'ffn_swiglu-20[ffn_gate-20, ffn_up-20]' op = GLU
nmse = 0.001466495933 tensor 'ffn_out-20[blk.20.ffn_down.weight, ffn_swiglu-20]' op = MUL_MAT
nmse = 0.000001748957 tensor 'l_out-20[ffn_out-20, ffn_inp-20]' op = ADD
nmse = 0.000496686979 tensor 'norm-21[l_out-20]' op = RMS_NORM
nmse = 0.000439504580 tensor 'attn_norm-21[norm-21, blk.21.attn_norm.weight]' op = MUL
nmse = 0.000339397701 tensor 'Qcur-21[blk.21.attn_q.weight, attn_norm-21]' op = MUL_MAT
nmse = 0.000581690260 tensor 'Vcur-21[blk.21.attn_v.weight, attn_norm-21]' op = MUL_MAT
nmse = 0.000438347976 tensor 'Kcur-21[blk.21.attn_k.weight, attn_norm-21]' op = MUL_MAT
nmse = 0.000302245749 tensor 'norm-21[Qcur-21 (reshaped)]' op = RMS_NORM
nmse = 0.000348207328 tensor 'Qcur_normed-21[norm-21, blk.21.attn_q_norm.weight]' op = MUL
nmse = 0.000348207414 tensor 'Qcur-21[Qcur_normed-21, leaf_5]' op = ROPE
nmse = 0.000367550663 tensor 'norm-21[Kcur-21 (reshaped)]' op = RMS_NORM
nmse = 0.000223839756 tensor 'Kcur_normed-21[norm-21, blk.21.attn_k_norm.weight]' op = MUL
nmse = 0.000223839733 tensor 'Kcur-21[Kcur_normed-21, leaf_5]' op = ROPE
nmse = 0.000223589721 tensor 'cache_k_l21 (view)[Kcur-21 (view), leaf_9, cache_k_l21]' op = SET_ROWS
nmse = 0.000581324943 tensor 'cache_v_l21 (reshaped) (view)[Vcur-21 (reshaped) (reshaped), leaf_11, cache_v_l21 (reshaped)]' op = SET_ROWS
nmse = 0.000039563583 tensor 'kq-21[cache_k_l21 (view) (permuted), Qcur-21 (view) (permuted)]' op = MUL_MAT
nmse = 0.000018477749 tensor 'kq_soft_max-21[kq-21, leaf_13]' op = SOFT_MAX
nmse = 0.000692106675 tensor 'kqv-21[cache_v_l21 (view) (permuted), kq_soft_max-21]' op = MUL_MAT
nmse = 0.000692106675 tensor 'kqv_out-21[kqv-21 (permuted)]' op = CONT
nmse = 0.000666896823 tensor 'node_872[blk.21.attn_output.weight, kqv_out-21]' op = MUL_MAT
nmse = 0.000001867437 tensor 'ffn_inp-21[node_872, l_out-20]' op = ADD
nmse = 0.000463129803 tensor 'norm-21[ffn_inp-21]' op = RMS_NORM
nmse = 0.000625134416 tensor 'ffn_norm-21[norm-21, blk.21.ffn_norm.weight]' op = MUL
nmse = 0.000402410702 tensor 'ffn_gate-21[blk.21.ffn_gate.weight, ffn_norm-21]' op = MUL_MAT
nmse = 0.000672651292 tensor 'ffn_up-21[blk.21.ffn_up.weight, ffn_norm-21]' op = MUL_MAT
nmse = 0.001140556475 tensor 'ffn_swiglu-21[ffn_gate-21, ffn_up-21]' op = GLU
nmse = 0.001142173190 tensor 'ffn_out-21[blk.21.ffn_down.weight, ffn_swiglu-21]' op = MUL_MAT
nmse = 0.000002427949 tensor 'l_out-21[ffn_out-21, ffn_inp-21]' op = ADD
nmse = 0.000478275749 tensor 'norm-22[l_out-21]' op = RMS_NORM
nmse = 0.000432202438 tensor 'attn_norm-22[norm-22, blk.22.attn_norm.weight]' op = MUL
nmse = 0.000397564302 tensor 'Qcur-22[blk.22.attn_q.weight, attn_norm-22]' op = MUL_MAT
nmse = 0.000623501189 tensor 'Vcur-22[blk.22.attn_v.weight, attn_norm-22]' op = MUL_MAT
nmse = 0.000401379921 tensor 'Kcur-22[blk.22.attn_k.weight, attn_norm-22]' op = MUL_MAT
nmse = 0.000325794107 tensor 'norm-22[Qcur-22 (reshaped)]' op = RMS_NORM
nmse = 0.000246351045 tensor 'Qcur_normed-22[norm-22, blk.22.attn_q_norm.weight]' op = MUL
nmse = 0.000246351084 tensor 'Qcur-22[Qcur_normed-22, leaf_5]' op = ROPE
nmse = 0.000317758859 tensor 'norm-22[Kcur-22 (reshaped)]' op = RMS_NORM
nmse = 0.000202772588 tensor 'Kcur_normed-22[norm-22, blk.22.attn_k_norm.weight]' op = MUL
nmse = 0.000202772541 tensor 'Kcur-22[Kcur_normed-22, leaf_5]' op = ROPE
nmse = 0.000202876598 tensor 'cache_k_l22 (view)[Kcur-22 (view), leaf_9, cache_k_l22]' op = SET_ROWS
nmse = 0.000623463398 tensor 'cache_v_l22 (reshaped) (view)[Vcur-22 (reshaped) (reshaped), leaf_11, cache_v_l22 (reshaped)]' op = SET_ROWS
nmse = 0.000040392167 tensor 'kq-22[cache_k_l22 (view) (permuted), Qcur-22 (view) (permuted)]' op = MUL_MAT
nmse = 0.000057179345 tensor 'kq_soft_max-22[kq-22, leaf_13]' op = SOFT_MAX
nmse = 0.000981911735 tensor 'kqv-22[cache_v_l22 (view) (permuted), kq_soft_max-22]' op = MUL_MAT
nmse = 0.000981911735 tensor 'kqv_out-22[kqv-22 (permuted)]' op = CONT
nmse = 0.001030680090 tensor 'node_912[blk.22.attn_output.weight, kqv_out-22]' op = MUL_MAT
nmse = 0.000002826090 tensor 'ffn_inp-22[node_912, l_out-21]' op = ADD
nmse = 0.000470390872 tensor 'norm-22[ffn_inp-22]' op = RMS_NORM
nmse = 0.000699296892 tensor 'ffn_norm-22[norm-22, blk.22.ffn_norm.weight]' op = MUL
nmse = 0.000439189575 tensor 'ffn_gate-22[blk.22.ffn_gate.weight, ffn_norm-22]' op = MUL_MAT
nmse = 0.000797374422 tensor 'ffn_up-22[blk.22.ffn_up.weight, ffn_norm-22]' op = MUL_MAT
nmse = 0.001291948537 tensor 'ffn_swiglu-22[ffn_gate-22, ffn_up-22]' op = GLU
nmse = 0.001284818601 tensor 'ffn_out-22[blk.22.ffn_down.weight, ffn_swiglu-22]' op = MUL_MAT
nmse = 0.000003666873 tensor 'l_out-22[ffn_out-22, ffn_inp-22]' op = ADD
nmse = 0.000467740642 tensor 'norm-23[l_out-22]' op = RMS_NORM
nmse = 0.000450062413 tensor 'attn_norm-23[norm-23, blk.23.attn_norm.weight]' op = MUL
nmse = 0.000336185751 tensor 'Qcur-23[blk.23.attn_q.weight, attn_norm-23]' op = MUL_MAT
nmse = 0.000670923167 tensor 'Vcur-23[blk.23.attn_v.weight, attn_norm-23]' op = MUL_MAT
nmse = 0.000386682381 tensor 'Kcur-23[blk.23.attn_k.weight, attn_norm-23]' op = MUL_MAT
nmse = 0.000259702269 tensor 'norm-23[Qcur-23 (reshaped)]' op = RMS_NORM
nmse = 0.000287596822 tensor 'Qcur_normed-23[norm-23, blk.23.attn_q_norm.weight]' op = MUL
nmse = 0.000287596822 tensor 'Qcur-23[Qcur_normed-23, leaf_5]' op = ROPE
nmse = 0.000321316605 tensor 'norm-23[Kcur-23 (reshaped)]' op = RMS_NORM
nmse = 0.000270389526 tensor 'Kcur_normed-23[norm-23, blk.23.attn_k_norm.weight]' op = MUL
nmse = 0.000270389562 tensor 'Kcur-23[Kcur_normed-23, leaf_5]' op = ROPE
nmse = 0.000270676205 tensor 'cache_k_l23 (view)[Kcur-23 (view), leaf_9, cache_k_l23]' op = SET_ROWS
nmse = 0.000671484943 tensor 'cache_v_l23 (reshaped) (view)[Vcur-23 (reshaped) (reshaped), leaf_11, cache_v_l23 (reshaped)]' op = SET_ROWS
nmse = 0.000038791561 tensor 'kq-23[cache_k_l23 (view) (permuted), Qcur-23 (view) (permuted)]' op = MUL_MAT
nmse = 0.000061932620 tensor 'kq_soft_max-23[kq-23, leaf_13]' op = SOFT_MAX
nmse = 0.001713889068 tensor 'kqv-23[cache_v_l23 (view) (permuted), kq_soft_max-23]' op = MUL_MAT
nmse = 0.001713889068 tensor 'kqv_out-23[kqv-23 (permuted)]' op = CONT
nmse = 0.001839481165 tensor 'node_952[blk.23.attn_output.weight, kqv_out-23]' op = MUL_MAT
nmse = 0.000004103790 tensor 'ffn_inp-23[node_952, l_out-22]' op = ADD
nmse = 0.000483415863 tensor 'norm-23[ffn_inp-23]' op = RMS_NORM
nmse = 0.000792511382 tensor 'ffn_norm-23[norm-23, blk.23.ffn_norm.weight]' op = MUL
nmse = 0.000475077995 tensor 'ffn_gate-23[blk.23.ffn_gate.weight, ffn_norm-23]' op = MUL_MAT
nmse = 0.000946514587 tensor 'ffn_up-23[blk.23.ffn_up.weight, ffn_norm-23]' op = MUL_MAT
nmse = 0.001973072711 tensor 'ffn_swiglu-23[ffn_gate-23, ffn_up-23]' op = GLU
nmse = 0.002156343351 tensor 'ffn_out-23[blk.23.ffn_down.weight, ffn_swiglu-23]' op = MUL_MAT
nmse = 0.000005165690 tensor 'l_out-23[ffn_out-23, ffn_inp-23]' op = ADD
nmse = 0.000528199877 tensor 'norm-24[l_out-23]' op = RMS_NORM
nmse = 0.000513939375 tensor 'attn_norm-24[norm-24, blk.24.attn_norm.weight]' op = MUL
nmse = 0.000351965275 tensor 'Qcur-24[blk.24.attn_q.weight, attn_norm-24]' op = MUL_MAT
nmse = 0.000656782703 tensor 'Vcur-24[blk.24.attn_v.weight, attn_norm-24]' op = MUL_MAT
nmse = 0.000383081739 tensor 'Kcur-24[blk.24.attn_k.weight, attn_norm-24]' op = MUL_MAT
nmse = 0.000385601972 tensor 'norm-24[Qcur-24 (reshaped)]' op = RMS_NORM
nmse = 0.000463461931 tensor 'Qcur_normed-24[norm-24, blk.24.attn_q_norm.weight]' op = MUL
nmse = 0.000463462001 tensor 'Qcur-24[Qcur_normed-24, leaf_5]' op = ROPE
nmse = 0.000302019703 tensor 'norm-24[Kcur-24 (reshaped)]' op = RMS_NORM
nmse = 0.000159458045 tensor 'Kcur_normed-24[norm-24, blk.24.attn_k_norm.weight]' op = MUL
nmse = 0.000159458074 tensor 'Kcur-24[Kcur_normed-24, leaf_5]' op = ROPE
nmse = 0.000159215106 tensor 'cache_k_l24 (view)[Kcur-24 (view), leaf_9, cache_k_l24]' op = SET_ROWS
nmse = 0.000656938033 tensor 'cache_v_l24 (reshaped) (view)[Vcur-24 (reshaped) (reshaped), leaf_11, cache_v_l24 (reshaped)]' op = SET_ROWS
nmse = 0.000046385623 tensor 'kq-24[cache_k_l24 (view) (permuted), Qcur-24 (view) (permuted)]' op = MUL_MAT
nmse = 0.000052847212 tensor 'kq_soft_max-24[kq-24, leaf_13]' op = SOFT_MAX
nmse = 0.001725606630 tensor 'kqv-24[cache_v_l24 (view) (permuted), kq_soft_max-24]' op = MUL_MAT
nmse = 0.001725606630 tensor 'kqv_out-24[kqv-24 (permuted)]' op = CONT
nmse = 0.001352525444 tensor 'node_992[blk.24.attn_output.weight, kqv_out-24]' op = MUL_MAT
nmse = 0.000005700273 tensor 'ffn_inp-24[node_992, l_out-23]' op = ADD
nmse = 0.000500859955 tensor 'norm-24[ffn_inp-24]' op = RMS_NORM
nmse = 0.000955662773 tensor 'ffn_norm-24[norm-24, blk.24.ffn_norm.weight]' op = MUL
nmse = 0.000648914960 tensor 'ffn_gate-24[blk.24.ffn_gate.weight, ffn_norm-24]' op = MUL_MAT
nmse = 0.001157290635 tensor 'ffn_up-24[blk.24.ffn_up.weight, ffn_norm-24]' op = MUL_MAT
nmse = 0.002242476549 tensor 'ffn_swiglu-24[ffn_gate-24, ffn_up-24]' op = GLU
nmse = 0.002300691523 tensor 'ffn_out-24[blk.24.ffn_down.weight, ffn_swiglu-24]' op = MUL_MAT
nmse = 0.000007524360 tensor 'l_out-24[ffn_out-24, ffn_inp-24]' op = ADD
nmse = 0.000560401040 tensor 'norm-25[l_out-24]' op = RMS_NORM
nmse = 0.000594065372 tensor 'attn_norm-25[norm-25, blk.25.attn_norm.weight]' op = MUL
nmse = 0.000367669627 tensor 'Qcur-25[blk.25.attn_q.weight, attn_norm-25]' op = MUL_MAT
nmse = 0.000705539152 tensor 'Vcur-25[blk.25.attn_v.weight, attn_norm-25]' op = MUL_MAT
nmse = 0.000348604101 tensor 'Kcur-25[blk.25.attn_k.weight, attn_norm-25]' op = MUL_MAT
nmse = 0.000272725478 tensor 'norm-25[Qcur-25 (reshaped)]' op = RMS_NORM
nmse = 0.000343857888 tensor 'Qcur_normed-25[norm-25, blk.25.attn_q_norm.weight]' op = MUL
nmse = 0.000343857931 tensor 'Qcur-25[Qcur_normed-25, leaf_5]' op = ROPE
nmse = 0.000261003220 tensor 'norm-25[Kcur-25 (reshaped)]' op = RMS_NORM
nmse = 0.000134749924 tensor 'Kcur_normed-25[norm-25, blk.25.attn_k_norm.weight]' op = MUL
nmse = 0.000134749969 tensor 'Kcur-25[Kcur_normed-25, leaf_5]' op = ROPE
nmse = 0.000134528362 tensor 'cache_k_l25 (view)[Kcur-25 (view), leaf_9, cache_k_l25]' op = SET_ROWS
nmse = 0.000705282999 tensor 'cache_v_l25 (reshaped) (view)[Vcur-25 (reshaped) (reshaped), leaf_11, cache_v_l25 (reshaped)]' op = SET_ROWS
nmse = 0.000034477949 tensor 'kq-25[cache_k_l25 (view) (permuted), Qcur-25 (view) (permuted)]' op = MUL_MAT
nmse = 0.000042878162 tensor 'kq_soft_max-25[kq-25, leaf_13]' op = SOFT_MAX
nmse = 0.000987369059 tensor 'kqv-25[cache_v_l25 (view) (permuted), kq_soft_max-25]' op = MUL_MAT
nmse = 0.000987369059 tensor 'kqv_out-25[kqv-25 (permuted)]' op = CONT
nmse = 0.001286477740 tensor 'node_1032[blk.25.attn_output.weight, kqv_out-25]' op = MUL_MAT
nmse = 0.000008939857 tensor 'ffn_inp-25[node_1032, l_out-24]' op = ADD
nmse = 0.000576597801 tensor 'norm-25[ffn_inp-25]' op = RMS_NORM
nmse = 0.001064857403 tensor 'ffn_norm-25[norm-25, blk.25.ffn_norm.weight]' op = MUL
nmse = 0.000803647505 tensor 'ffn_gate-25[blk.25.ffn_gate.weight, ffn_norm-25]' op = MUL_MAT
nmse = 0.001227472149 tensor 'ffn_up-25[blk.25.ffn_up.weight, ffn_norm-25]' op = MUL_MAT
nmse = 0.001961554041 tensor 'ffn_swiglu-25[ffn_gate-25, ffn_up-25]' op = GLU
nmse = 0.002135897409 tensor 'ffn_out-25[blk.25.ffn_down.weight, ffn_swiglu-25]' op = MUL_MAT
nmse = 0.000011452851 tensor 'l_out-25[ffn_out-25, ffn_inp-25]' op = ADD
nmse = 0.000551810430 tensor 'norm-26[l_out-25]' op = RMS_NORM
nmse = 0.000692091436 tensor 'attn_norm-26[norm-26, blk.26.attn_norm.weight]' op = MUL
nmse = 0.000580864228 tensor 'Qcur-26[blk.26.attn_q.weight, attn_norm-26]' op = MUL_MAT
nmse = 0.000933651197 tensor 'Vcur-26[blk.26.attn_v.weight, attn_norm-26]' op = MUL_MAT
nmse = 0.000449663654 tensor 'Kcur-26[blk.26.attn_k.weight, attn_norm-26]' op = MUL_MAT
nmse = 0.000510016466 tensor 'norm-26[Qcur-26 (reshaped)]' op = RMS_NORM
nmse = 0.000511463930 tensor 'Qcur_normed-26[norm-26, blk.26.attn_q_norm.weight]' op = MUL
nmse = 0.000511464040 tensor 'Qcur-26[Qcur_normed-26, leaf_5]' op = ROPE
nmse = 0.000333862481 tensor 'norm-26[Kcur-26 (reshaped)]' op = RMS_NORM
nmse = 0.000232198239 tensor 'Kcur_normed-26[norm-26, blk.26.attn_k_norm.weight]' op = MUL
nmse = 0.000232198266 tensor 'Kcur-26[Kcur_normed-26, leaf_5]' op = ROPE
nmse = 0.000232032895 tensor 'cache_k_l26 (view)[Kcur-26 (view), leaf_9, cache_k_l26]' op = SET_ROWS
nmse = 0.000933728681 tensor 'cache_v_l26 (reshaped) (view)[Vcur-26 (reshaped) (reshaped), leaf_11, cache_v_l26 (reshaped)]' op = SET_ROWS
nmse = 0.000055128796 tensor 'kq-26[cache_k_l26 (view) (permuted), Qcur-26 (view) (permuted)]' op = MUL_MAT
nmse = 0.000014078140 tensor 'kq_soft_max-26[kq-26, leaf_13]' op = SOFT_MAX
nmse = 0.001137321131 tensor 'kqv-26[cache_v_l26 (view) (permuted), kq_soft_max-26]' op = MUL_MAT
nmse = 0.001137321131 tensor 'kqv_out-26[kqv-26 (permuted)]' op = CONT
nmse = 0.001334865071 tensor 'node_1072[blk.26.attn_output.weight, kqv_out-26]' op = MUL_MAT
nmse = 0.000011907648 tensor 'ffn_inp-26[node_1072, l_out-25]' op = ADD
nmse = 0.000626487174 tensor 'norm-26[ffn_inp-26]' op = RMS_NORM
nmse = 0.001042428413 tensor 'ffn_norm-26[norm-26, blk.26.ffn_norm.weight]' op = MUL
nmse = 0.000815629521 tensor 'ffn_gate-26[blk.26.ffn_gate.weight, ffn_norm-26]' op = MUL_MAT
nmse = 0.001034428365 tensor 'ffn_up-26[blk.26.ffn_up.weight, ffn_norm-26]' op = MUL_MAT
nmse = 0.000309779968 tensor 'ffn_swiglu-26[ffn_gate-26, ffn_up-26]' op = GLU
nmse = 0.001074392763 tensor 'ffn_out-26[blk.26.ffn_down.weight, ffn_swiglu-26]' op = MUL_MAT
nmse = 0.000015657634 tensor 'l_out-26[ffn_out-26, ffn_inp-26]' op = ADD
nmse = 0.000658292139 tensor 'norm-27[l_out-26]' op = RMS_NORM
nmse = 0.001244258205 tensor 'attn_norm-27[norm-27, blk.27.attn_norm.weight]' op = MUL
nmse = 0.000664813509 tensor 'Qcur-27[blk.27.attn_q.weight, attn_norm-27]' op = MUL_MAT
nmse = 0.002414087388 tensor 'Vcur-27[blk.27.attn_v.weight, attn_norm-27]' op = MUL_MAT
nmse = 0.000661790644 tensor 'Kcur-27[blk.27.attn_k.weight, attn_norm-27]' op = MUL_MAT
nmse = 0.000581336529 tensor 'norm-27[Qcur-27 (reshaped)]' op = RMS_NORM
nmse = 0.000296842273 tensor 'Qcur_normed-27[norm-27, blk.27.attn_q_norm.weight]' op = MUL
nmse = 0.000296842237 tensor 'Qcur-27[Qcur_normed-27, leaf_5]' op = ROPE
nmse = 0.000579479141 tensor 'norm-27[Kcur-27 (reshaped)]' op = RMS_NORM
nmse = 0.000397761311 tensor 'Kcur_normed-27[norm-27, blk.27.attn_k_norm.weight]' op = MUL
nmse = 0.000397761343 tensor 'Kcur-27[Kcur_normed-27, leaf_5]' op = ROPE
nmse = 0.000398193507 tensor 'cache_k_l27 (view)[Kcur-27 (view), leaf_9, cache_k_l27]' op = SET_ROWS
nmse = 0.002413719535 tensor 'cache_v_l27 (reshaped) (view)[Vcur-27 (reshaped) (reshaped), leaf_11, cache_v_l27 (reshaped)]' op = SET_ROWS
nmse = 0.000999981237 tensor 'node_1114[l_out-26, leaf_367]' op = GET_ROWS
nmse = 0.000117930554 tensor 'kq-27[cache_k_l27 (view) (permuted), Qcur-27 (view) (permuted)]' op = MUL_MAT
nmse = 0.000219567679 tensor 'kq_soft_max-27[kq-27, leaf_13]' op = SOFT_MAX
nmse = 0.003910016234 tensor 'kqv-27[cache_v_l27 (view) (permuted), kq_soft_max-27]' op = MUL_MAT
nmse = 0.003910016234 tensor 'kqv_out-27[kqv-27 (permuted)]' op = CONT
nmse = 0.004145341173 tensor 'node_1112[blk.27.attn_output.weight, kqv_out-27]' op = MUL_MAT
nmse = 0.004152228392 tensor 'node_1113[node_1112, leaf_367]' op = GET_ROWS
nmse = 0.001169890280 tensor 'ffn_inp-27[node_1113, node_1114]' op = ADD
nmse = 0.001131561247 tensor 'norm-27[ffn_inp-27]' op = RMS_NORM
nmse = 0.001789381224 tensor 'ffn_norm-27[norm-27, blk.27.ffn_norm.weight]' op = MUL
nmse = 0.000427780470 tensor 'ffn_gate-27[blk.27.ffn_gate.weight, ffn_norm-27]' op = MUL_MAT
nmse = 0.000740579453 tensor 'ffn_up-27[blk.27.ffn_up.weight, ffn_norm-27]' op = MUL_MAT
nmse = 0.001095369783 tensor 'ffn_swiglu-27[ffn_gate-27, ffn_up-27]' op = GLU
nmse = 0.001340996304 tensor 'ffn_out-27[blk.27.ffn_down.weight, ffn_swiglu-27]' op = MUL_MAT
nmse = 0.001408428197 tensor 'l_out-27[ffn_out-27, ffn_inp-27]' op = ADD
nmse = 0.001396804485 tensor 'norm[l_out-27]' op = RMS_NORM
nmse = 0.001945843256 tensor 'result_norm[norm, output_norm.weight]' op = MUL
nmse = 0.001751208117 tensor 'result_output[token_embd.weight, result_norm]' op = MUL_MAT

The error is accumulating every deviation over time inside of the model graph, so it's higher than in test-backend-ops. But still nothing out of the ordinary. It looks very similar on the ROCm backend. Let me know if you have suggestions on how to improve the tool.

I could run the CPU side step by step next to the GPU ops as well, similar to how the Vulkan result checker works, that would give more meaningful error values.

@ggerganov
Copy link
Member

I think the part that is failing here is processing the prompt, saving the state to a file, generating a number of tokens, restoring the state, generating the same amount again and then checking whether the two text results are identical.

Yes, this is correct. We store the state of the llama_context which includes the state of the memory (i.e. the KV cache), the last generated logits, model info, etc.

Notice that after restoring the state, most of the initial generated tokens match the generated tokens from the first run - this is the expectation. But in some rare cases, the sequence starts to diverge for some reason.

@jeffbolznv
Copy link
Collaborator

I was able to reproduce this on my 5090 and am pretty sure I see what's happening. It's caused by the "add_rms_fusion" optimization which accumulates partial sums into a temporary buffer and uses those to accelerate/parallelize the rms_norm. This optimization does not happen on the first token of the second run, because we only allocate memory for the temporary buffer after having made a pass through the model on a previous token. For the first run, the memory gets allocated during the prompt processing run on the same context (maybe also during the warmup run? Didn't check that).

The difference in rms_norm calculation leads to a slight precision difference. I think each run is deterministic, but they (arguably) start from a different initial state.

FWIW, there's a chance we could ditch this add_rms_fusion optimization - I think a lot of its benefits have moved into other fusion optimizations, though this bug is evidence that it still happens sometimes. I'd need to do a bit of perf testing.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 25, 2025

So is it just a coincidence that it triggered on this PR or did I make it worse in some way?

@jeffbolznv
Copy link
Collaborator

I don't think you made it worse, it's probably just the different precision for MMVQ caused it to manifest.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Nov 25, 2025

Thank you for looking into it. It's great we are finally making progress on this, would be nice to get this PR merged at last.

@jeffbolznv
Copy link
Collaborator

FWIW, there's a chance we could ditch this add_rms_fusion optimization - I think a lot of its benefits have moved into other fusion optimizations, though this bug is evidence that it still happens sometimes. I'd need to do a bit of perf testing.

I checked the models I usually run. Most only hit this opt once per token, but qwen3moe and deepseek2 still hit it pretty frequently, and benefit around 1% from it.

Another option might be to just reserve some memory at context creation for this, and either resize immediately when needed or never resize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants