Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llamafile : improve moe prompt eval speed on cpu #6840

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jart
Copy link
Contributor

@jart jart commented Apr 23, 2024

This change introduces a llamafile_mixmul() API that allows tinyBLAS to speed up "Mixture of Expert" models. On my Threadripper, Mixtral's 8x7b F16 weights now process prompts 2x faster. I'm also seeing a 60 percent improvement with Mixtral 8x22b Q4_0. The same applies to Q8_0, which is also supported by tinyBLAS. MoE models spend the majority of their time inside MUL_MAT_ID rather than MUL_MAT, which is why llamafile_sgemm was not able to help them before. llamafile_mixmul works by decomposing the mixmul operation into approximatively two sgemm calls.

@jart jart force-pushed the moe branch 2 times, most recently from def794c to 828f3fe Compare April 23, 2024 05:44
@hiepxanh
Copy link

nice to see this PR <3 Thank you so much

@USBhost
Copy link

USBhost commented Apr 23, 2024

Does it also help the other K quants?

@jart
Copy link
Contributor Author

jart commented Apr 23, 2024

@USBhost Unfortunately no. The K quants were designed to exploit under-utilization of CPU resources when doing matvecs. I tried copying and pasting the Q5_K_M code into a tinyBLAS 2-d block-tiling kernel, but the compiler wasn't able to unroll it it in a way that offered performance gains through instruction level parallelism. I've only been able to make the simpler quants work. It's a shame because I really like Q5_K_M, so it'd be great to see Iwan Kawrakow develop a new quant specifically for block-tiling.

@USBhost
Copy link

USBhost commented Apr 23, 2024

@USBhost Unfortunately no. The K quants were designed to exploit under-utilization of CPU resources when doing matvecs. I tried copying and pasting the Q5_K_M code into a tinyBLAS 2-d block-tiling kernel, but the compiler wasn't able to unroll it it in a way that offered performance gains through instruction level parallelism. I've only been able to make the simpler quants work. It's a shame because I really like Q5_K_M, so it'd be great to see Iwan Kawrakow develop a new quant specifically for block-tiling.

I see thanks for the explanation.

Side note: I would love a doc that explains the speed between flat 4_0 vs 4_1 vs K quants. Because I keep seeing the simple ones getting buffs.

@jart
Copy link
Contributor Author

jart commented Apr 23, 2024

The tinyBLAS code upstreamed by Mozilla's llamafile project makes prompt processing go very fast for F32, F16, Q4_0, and Q8_0.

model size params backend threads test t/s
llama 1B F16 2.05 GiB 1.10 B CPU 96 pp 512 2048.86 ± 6.52
llama 1B F16 2.05 GiB 1.10 B CPU 96 tg 4 52.01 ± 0.06
llama 1B all F32 4.10 GiB 1.10 B CPU 96 pp 512 1946.19 ± 21.26
llama 1B all F32 4.10 GiB 1.10 B CPU 96 tg 4 39.75 ± 0.15
llama 1B Q2_K - Medium 411.41 MiB 1.10 B CPU 96 pp 512 1273.08 ± 18.57
llama 1B Q2_K - Medium 411.41 MiB 1.10 B CPU 96 tg 4 69.55 ± 0.22
llama 1B Q3_K - Large 563.42 MiB 1.10 B CPU 96 pp 512 1109.08 ± 7.66
llama 1B Q3_K - Large 563.42 MiB 1.10 B CPU 96 tg 4 66.83 ± 0.41
llama 1B Q3_K - Medium 522.30 MiB 1.10 B CPU 96 pp 512 1161.17 ± 7.58
llama 1B Q3_K - Medium 522.30 MiB 1.10 B CPU 96 tg 4 67.49 ± 0.10
llama 1B Q3_K - Small 475.51 MiB 1.10 B CPU 96 pp 512 1052.25 ± 161.64
llama 1B Q3_K - Small 475.51 MiB 1.10 B CPU 96 tg 4 68.30 ± 0.19
llama 1B Q4_0 606.53 MiB 1.10 B CPU 96 pp 512 1418.41 ± 10.54
llama 1B Q4_0 606.53 MiB 1.10 B CPU 96 tg 4 65.78 ± 0.24
llama 1B Q4_1 668.18 MiB 1.10 B CPU 96 pp 512 884.68 ± 3.74
llama 1B Q4_1 668.18 MiB 1.10 B CPU 96 tg 4 64.62 ± 0.05
llama 1B Q4_K - Medium 636.18 MiB 1.10 B CPU 96 pp 512 1197.76 ± 11.42
llama 1B Q4_K - Medium 636.18 MiB 1.10 B CPU 96 tg 4 66.04 ± 0.34
llama 1B Q4_K - Small 609.53 MiB 1.10 B CPU 96 pp 512 1200.22 ± 10.06
llama 1B Q4_K - Small 609.53 MiB 1.10 B CPU 96 tg 4 66.38 ± 0.27
llama 1B Q5_0 729.84 MiB 1.10 B CPU 96 pp 512 1058.68 ± 10.52
llama 1B Q5_0 729.84 MiB 1.10 B CPU 96 tg 4 63.10 ± 0.36
llama 1B Q5_1 791.50 MiB 1.10 B CPU 96 pp 512 718.18 ± 127.77
llama 1B Q5_1 791.50 MiB 1.10 B CPU 96 tg 4 62.07 ± 0.67
llama 1B Q5_K - Medium 745.11 MiB 1.10 B CPU 96 pp 512 1055.78 ± 5.80
llama 1B Q5_K - Medium 745.11 MiB 1.10 B CPU 96 tg 4 64.01 ± 0.16
llama 1B Q5_K - Small 729.84 MiB 1.10 B CPU 96 pp 512 1048.20 ± 3.90
llama 1B Q5_K - Small 729.84 MiB 1.10 B CPU 96 tg 4 64.32 ± 0.27
llama 1B Q6_K 860.86 MiB 1.10 B CPU 96 pp 512 995.96 ± 183.61
llama 1B Q6_K 860.86 MiB 1.10 B CPU 96 tg 4 62.67 ± 0.22
llama 1B Q8_0 1.09 GiB 1.10 B CPU 96 pp 512 1430.38 ± 9.86
llama 1B Q8_0 1.09 GiB 1.10 B CPU 96 tg 4 59.71 ± 0.14

Measured on AMD Ryzen Threadripper PRO 7995WX with TinyLlama 1.1B. This PR ensures those performance wins will happen for MoE models too.

@jart
Copy link
Contributor Author

jart commented Apr 24, 2024

Note: I'm still in the process of testing this change and verifying it's correct on all compilers and architectures.

@jart jart force-pushed the moe branch 2 times, most recently from 26ab943 to 89991a1 Compare April 25, 2024 00:56
@jart
Copy link
Contributor Author

jart commented Apr 25, 2024

OK I've worked out the remaining kinks. This code was just shipped as part of the llamafile 0.8 release. Thanks to this change, I'm seeing a 2x prompt eval speed increase across the board. My Threadripper now runs Mixtral 2x faster. My M2 Ultra runs Mixtral 2x faster on CPU. This change even pumps up the Raspberry Pi 5 to 78 tok/sec performance on non-MoE F16 models in case you want to buy a bag full of the things to build your next supercomputer. PTAL.

@ikawrakow
Copy link
Contributor

It's a shame because I really like Q5_K_M, so it'd be great to see Iwan Kawrakow develop a new quant specifically for block-tiling.

@jart

I became intrigued by your assumption that block-tiling is required to speed up prompt processing for k-quants, so spent some time optimizing k-quant CPU matrix multiplications. I'm running on a 16-core Ryzen-7950X CPU, so have done just a better AVX2 implementation. Baseline for this CPU (using your PR) for a 7B LLaMA is

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 pp 512 119.75 ± 0.38
llama 7B Q4_1 3.95 GiB 6.74 B CPU 16 pp 512 63.80 ± 0.22
llama 7B Q5_0 4.33 GiB 6.74 B CPU 16 pp 512 59.50 ± 0.11
llama 7B Q5_1 4.72 GiB 6.74 B CPU 16 pp 512 56.28 ± 0.08

Q4_0 is much faster than the other legacy quants thanks to your tinyBLAS.

Here is what I get for k-quants

model size params test t/s (master) t/s (optimized) Speedup
llama 7B Q2_K - Small 2.16 GiB 6.74 B pp 512 116.35 ± 0.08 162.66 ± 1.21 1.398 ± 0.009
llama 7B Q2_K - Medium 2.36 GiB 6.74 B pp 512 100.49 ± 0.15 149.31 ± 0.48 1.486 ± 0.006
llama 7B Q3_K - Small 2.75 GiB 6.74 B pp 512 82.01 ± 0.11 132.68 ± 0.21 1.618 ± 0.003
llama 7B Q3_K - Medium 3.07 GiB 6.74 B pp 512 89.16 ± 0.10 136.53 ± 0.19 1.531 ± 0.003
llama 7B Q4_K - Small 3.59 GiB 6.74 B pp 512 104.45 ± 0.15 144.87 ± 0.31 1.387 ± 0.003
llama 7B Q4_K - Medium 3.80 GiB 6.74 B pp 512 101.50 ± 0.28 145.77 ± 0.33 1.436 ± 0.005
llama 7B Q5_K - Small 4.33 GiB 6.74 B pp 512 73.72 ± 0.29 124.05 ± 0.14 1.682 ± 0.007
llama 7B Q5_K - Medium 4.45 GiB 6.74 B pp 512 74.84 ± 0.13 126.52 ± 0.14 1.691 ± 0.004
llama 7B Q6_K 5.15 GiB 6.74 B pp 512 81.38 ± 0.09 146.30 ± 0.15 1.798 ± 0.003

You favorite Q5_K_M quants are faster than Q4_0 with tinyBLAS with these changes :-)

There are 3 ingredients involved in this speedup:

  • Simdify Q8_K quantization. Quantization of activations is single-threaded in ggml. Quantization to Q8_0, needed by legacy quants, is already simdified, but quantization to Q8_K, required by k-quants, is not. This does not matter for token generation, but we get a 5-6% speedup for prompt processing.
  • Tweak some more the k-quant dot product kernels to reduce/eliminate dependencies between computational steps. I guess this is what you call "instruction level parallelism"
  • Last, but certainly not least, make use of the ability to do 2x2 matrix multiplications (rather than just dot products) that is already available in ggml (2 weight rows times two activation columns). This gives 20-40% speedup, depending on how costly it is to setup the bits as needed for the multiplication with the Q8_K quants.

We see Q4_K and Q5_K being ~2.2 times faster than their respective legacy counterparts Q4_1 and Q5_1.

@jart jart force-pushed the moe branch 2 times, most recently from f1a134a to c34c472 Compare April 26, 2024 15:47
@jart
Copy link
Contributor Author

jart commented Apr 26, 2024

That's outstanding news @ikawrakow! I can't wait to see your code. Now I won't need to recommend the legacy quantization formats. Am I correct in understanding you used .nrows? Have you tried copying your optimized code into a tinyBLAS kernel? If you do that, then K quants might be able to surpass F16 and BF16 at evaluation.

@ikawrakow
Copy link
Contributor

@jart

Yes, I used .nrows. I did that because it was not obvious to me how I can plug that into tinyBLAS. I could make a PR to your repository so you can do the integration into tinyBLAS?

@jart
Copy link
Contributor Author

jart commented Apr 26, 2024

@ikawrakow Receiving a PR from you would honor the llamafile project. What you'd want to do is create a copy of tinyBLAS_Q0_ARM named like tinyBLAS_K5_ARM and then have your vec_dot code replace these specific lines. You'd then tune its mnpack() method to be smaller until eventually there's no stack spillage.

Copy link
Contributor

github-actions bot commented Apr 26, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 570 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8180.47ms p(95)=19678.15ms fails=, finish reason: stop=511 truncated=59
  • Prompt processing (pp): avg=93.49tk/s p(95)=393.96tk/s
  • Token generation (tg): avg=34.01tk/s p(95)=49.86tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=moe commit=bb3a5274c7c1efd883f7e57edb849c0394d2c91d

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 570 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715528716 --> 1715529342
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 351.7, 351.7, 351.7, 351.7, 351.7, 689.37, 689.37, 689.37, 689.37, 689.37, 566.85, 566.85, 566.85, 566.85, 566.85, 597.08, 597.08, 597.08, 597.08, 597.08, 658.88, 658.88, 658.88, 658.88, 658.88, 676.83, 676.83, 676.83, 676.83, 676.83, 680.7, 680.7, 680.7, 680.7, 680.7, 701.39, 701.39, 701.39, 701.39, 701.39, 722.37, 722.37, 722.37, 722.37, 722.37, 737.99, 737.99, 737.99, 737.99, 737.99, 769.43, 769.43, 769.43, 769.43, 769.43, 781.02, 781.02, 781.02, 781.02, 781.02, 781.84, 781.84, 781.84, 781.84, 781.84, 810.53, 810.53, 810.53, 810.53, 810.53, 790.89, 790.89, 790.89, 790.89, 790.89, 794.15, 794.15, 794.15, 794.15, 794.15, 792.32, 792.32, 792.32, 792.32, 792.32, 817.94, 817.94, 817.94, 817.94, 817.94, 817.43, 817.43, 817.43, 817.43, 817.43, 823.84, 823.84, 823.84, 823.84, 823.84, 828.15, 828.15, 828.15, 828.15, 828.15, 829.94, 829.94, 829.94, 829.94, 829.94, 824.55, 824.55, 824.55, 824.55, 824.55, 828.53, 828.53, 828.53, 828.53, 828.53, 840.82, 840.82, 840.82, 840.82, 840.82, 838.33, 838.33, 838.33, 838.33, 838.33, 837.49, 837.49, 837.49, 837.49, 837.49, 838.1, 838.1, 838.1, 838.1, 838.1, 843.67, 843.67, 843.67, 843.67, 843.67, 843.03, 843.03, 843.03, 843.03, 843.03, 844.97, 844.97, 844.97, 844.97, 844.97, 844.92, 844.92, 844.92, 844.92, 844.92, 854.04, 854.04, 854.04, 854.04, 854.04, 851.33, 851.33, 851.33, 851.33, 851.33, 838.0, 838.0, 838.0, 838.0, 838.0, 836.58, 836.58, 836.58, 836.58, 836.58, 839.07, 839.07, 839.07, 839.07, 839.07, 839.93, 839.93, 839.93, 839.93, 839.93, 840.09, 840.09, 840.09, 840.09, 840.09, 843.14, 843.14, 843.14, 843.14, 843.14, 850.48, 850.48, 850.48, 850.48, 850.48, 849.68, 849.68, 849.68, 849.68, 849.68, 847.76, 847.76, 847.76, 847.76, 847.76, 844.44, 844.44, 844.44, 844.44, 844.44, 849.1, 849.1, 849.1, 849.1, 849.1, 850.97, 850.97, 850.97, 850.97, 850.97, 852.84, 852.84, 852.84, 852.84, 852.84, 851.83, 851.83, 851.83, 851.83, 851.83, 856.22, 856.22, 856.22, 856.22, 856.22, 859.22, 859.22, 859.22, 859.22, 859.22, 858.41, 858.41, 858.41, 858.41, 858.41, 863.34, 863.34, 863.34, 863.34, 863.34, 865.05, 865.05, 865.05, 865.05, 865.05, 864.38, 864.38, 864.38, 864.38, 864.38, 865.17, 865.17, 865.17, 865.17, 865.17, 862.92, 862.92, 862.92, 862.92, 862.92, 862.6, 862.6, 862.6, 862.6, 862.6, 865.29, 865.29, 865.29, 865.29, 865.29, 865.53, 865.53, 865.53, 865.53, 865.53, 864.4, 864.4, 864.4, 864.4, 864.4, 863.31, 863.31, 863.31, 863.31]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 570 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715528716 --> 1715529342
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.27, 40.27, 40.27, 40.27, 40.27, 41.21, 41.21, 41.21, 41.21, 41.21, 30.23, 30.23, 30.23, 30.23, 30.23, 30.84, 30.84, 30.84, 30.84, 30.84, 31.6, 31.6, 31.6, 31.6, 31.6, 31.93, 31.93, 31.93, 31.93, 31.93, 33.17, 33.17, 33.17, 33.17, 33.17, 34.04, 34.04, 34.04, 34.04, 34.04, 34.36, 34.36, 34.36, 34.36, 34.36, 34.46, 34.46, 34.46, 34.46, 34.46, 34.07, 34.07, 34.07, 34.07, 34.07, 34.14, 34.14, 34.14, 34.14, 34.14, 33.6, 33.6, 33.6, 33.6, 33.6, 33.21, 33.21, 33.21, 33.21, 33.21, 32.34, 32.34, 32.34, 32.34, 32.34, 32.15, 32.15, 32.15, 32.15, 32.15, 32.7, 32.7, 32.7, 32.7, 32.7, 32.48, 32.48, 32.48, 32.48, 32.48, 32.51, 32.51, 32.51, 32.51, 32.51, 32.48, 32.48, 32.48, 32.48, 32.48, 32.44, 32.44, 32.44, 32.44, 32.44, 32.73, 32.73, 32.73, 32.73, 32.73, 32.66, 32.66, 32.66, 32.66, 32.66, 32.88, 32.88, 32.88, 32.88, 32.88, 32.97, 32.97, 32.97, 32.97, 32.97, 32.83, 32.83, 32.83, 32.83, 32.83, 32.55, 32.55, 32.55, 32.55, 32.55, 32.65, 32.65, 32.65, 32.65, 32.65, 32.92, 32.92, 32.92, 32.92, 32.92, 32.92, 32.92, 32.92, 32.92, 32.92, 33.17, 33.17, 33.17, 33.17, 33.17, 33.27, 33.27, 33.27, 33.27, 33.27, 33.06, 33.06, 33.06, 33.06, 33.06, 32.84, 32.84, 32.84, 32.84, 32.84, 32.55, 32.55, 32.55, 32.55, 32.55, 32.51, 32.51, 32.51, 32.51, 32.51, 32.66, 32.66, 32.66, 32.66, 32.66, 32.77, 32.77, 32.77, 32.77, 32.77, 32.9, 32.9, 32.9, 32.9, 32.9, 32.93, 32.93, 32.93, 32.93, 32.93, 32.79, 32.79, 32.79, 32.79, 32.79, 31.75, 31.75, 31.75, 31.75, 31.75, 31.61, 31.61, 31.61, 31.61, 31.61, 30.39, 30.39, 30.39, 30.39, 30.39, 30.33, 30.33, 30.33, 30.33, 30.33, 30.43, 30.43, 30.43, 30.43, 30.43, 30.58, 30.58, 30.58, 30.58, 30.58, 30.72, 30.72, 30.72, 30.72, 30.72, 30.74, 30.74, 30.74, 30.74, 30.74, 30.73, 30.73, 30.73, 30.73, 30.73, 30.5, 30.5, 30.5, 30.5, 30.5, 30.4, 30.4, 30.4, 30.4, 30.4, 30.45, 30.45, 30.45, 30.45, 30.45, 30.63, 30.63, 30.63, 30.63, 30.63, 30.69, 30.69, 30.69, 30.69, 30.69, 30.83, 30.83, 30.83, 30.83, 30.83, 30.92, 30.92, 30.92, 30.92, 30.92, 30.92, 30.92, 30.92, 30.92, 30.92, 30.92, 30.92, 30.92, 30.92, 30.92, 30.96, 30.96, 30.96, 30.96, 30.96, 31.05, 31.05, 31.05, 31.05]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 570 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715528716 --> 1715529342
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.39, 0.39, 0.39, 0.39, 0.39, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.31, 0.31, 0.31, 0.31, 0.31, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.31, 0.31, 0.31, 0.31, 0.31, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.08, 0.08, 0.08, 0.08, 0.08, 0.16, 0.16, 0.16, 0.16, 0.16, 0.37, 0.37, 0.37, 0.37, 0.37, 0.48, 0.48, 0.48, 0.48, 0.48, 0.5, 0.5, 0.5, 0.5, 0.5, 0.53, 0.53, 0.53, 0.53, 0.53, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 570 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715528716 --> 1715529342
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]
                    

This change introduces a llamafile_mixmul() API that allows tinyBLAS to
speed up "Mixture of Expert" models. On my Threadripper, Mixtral's 8x7b
F16 weights now process prompts 2x faster. I'm also seeing a 60 percent
improvement with Mixtral 8x22b Q4_0. The same applies to Q8_0, which is
also supported by tinyBLAS. MoE models spend the majority of their time
inside MUL_MAT_ID rather than MUL_MAT, which is why llamafile_sgemm was
not able to help them before. llamafile_mixmul works by decomposing the
mixmul operation into sgemm calls.
@lemmi
Copy link

lemmi commented May 12, 2024

Here's a benchmark of an AMD V3C48 (a Zen 3 part) with Mistral 7B Instruct v0.2. (I had to throw out some code that used the X86_HAVE(F16C) check to make it compile though)

PR:

model size params backend threads test t/s
mistral 7B BF16 13.49 GiB 7.24 B CPU 6 pp512 23.25 ± 0.17
mistral 7B BF16 13.49 GiB 7.24 B CPU 6 tg128 3.04 ± 0.01
mistral 7B F16 13.49 GiB 7.24 B CPU 6 pp512 23.10 ± 0.10
mistral 7B F16 13.49 GiB 7.24 B CPU 6 tg128 3.02 ± 0.01
mistral 7B all F32 26.98 GiB 7.24 B CPU 6 pp512 20.00 ± 0.06
mistral 7B all F32 26.98 GiB 7.24 B CPU 6 tg128 1.52 ± 0.01

Without these changes, prompt processing for BF16 clocks in at about 11 t/s (see #7182), rest stays the same. Good improvement overall :)

(I'm still a bit confused as to why F16 performs so much better than BF16 without tinyblas and whether there is still something left on the table, but at least this way there is no compromise in using BF16 now)

ggml.c Outdated Show resolved Hide resolved
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@jart
Copy link
Contributor Author

jart commented May 12, 2024

@lemmi where you're going to see the biggest changes here are running mixtral (rather than mistral) because moe models use MUL_MAT_ID (that's where they spend most of their clock cycles) and until this change we had no BLAS support whatsoever for MUL_MAT_ID. As for BF16 vs. F16 this change introduces tinyBLAS support for BF16 (which is a very recently introduced data type) so finally having BLAS-like performance for BF16 is naturally going to help it catch up with F16, and then surpass it on znver4.

@ggerganov
Copy link
Owner

Take a look at the failing CI run: https://github.com/ggerganov/llama.cpp/actions/runs/9052429493/job/24870086560?pr=6840

D:\a\llama.cpp\llama.cpp\sgemm.cpp(827,59): error C3861: 'MM256_SET_M128I': identifier not found [D:\a\llama.cpp\llama.cpp\build\ggml.vcxproj]

@ggerganov ggerganov added the merging soon Will merge soon unless anyone objects label May 15, 2024
@ggerganov
Copy link
Owner

@jart I think the following patch should fix the CI:

diff --git a/ggml-impl.h b/ggml-impl.h
index d85b152b..85d3f23f 100644
--- a/ggml-impl.h
+++ b/ggml-impl.h
@@ -17,6 +17,9 @@
 #define MIN(a, b) ((a) < (b) ? (a) : (b))
 #define MAX(a, b) ((a) > (b) ? (a) : (b))
 
+// some compilers don't provide _mm256_set_m128i, e.g. gcc 7
+#define MM256_SET_M128I(a, b) _mm256_insertf128_si256(_mm256_castsi128_si256(b), (a), 1)
+
 /**
  * Converts brain16 to float32.
  *
diff --git a/ggml-quants.c b/ggml-quants.c
index 00334c5f..3677b2db 100644
--- a/ggml-quants.c
+++ b/ggml-quants.c
@@ -22,9 +22,6 @@
 
 #define UNUSED GGML_UNUSED
 
-// some compilers don't provide _mm256_set_m128i, e.g. gcc 7
-#define MM256_SET_M128I(a, b) _mm256_insertf128_si256(_mm256_castsi128_si256(b), (a), 1)
-
 #if defined(__AVX__) || defined(__AVX2__) || defined(__AVX512F__) || defined(__SSSE3__)
 // multiply int8_t, add results pairwise twice
 static inline __m128i mul_sum_i8_pairs(const __m128i x, const __m128i y) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request merging soon Will merge soon unless anyone objects review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants