Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF16 prompt processing has half the performance compared to F16 and F32 von AMD Ryzen Embedded V3000 (Zen 3) #7182

Open
lemmi opened this issue May 9, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@lemmi
Copy link

lemmi commented May 9, 2024

I was interested what impact the BF16 format #6412 had on my cpu (AMD Ryzen Embedded V3000 V3C48, which uses Zen 3 cores, 4800MT/s ECC RAM).
Surprisingly the prompt processing was just half of the performance compared to F16 and F32 formats. Token generation is slightly faster.

Here is llama-bench with sgemm:

model size params backend threads test t/s
mistral 7B BF16 13.49 GiB 7.24 B CPU 6 pp 512 11.18 ± 0.05
mistral 7B BF16 13.49 GiB 7.24 B CPU 6 tg 128 3.26 ± 0.03
mistral 7B F16 13.49 GiB 7.24 B CPU 6 pp 512 23.29 ± 0.07
mistral 7B F16 13.49 GiB 7.24 B CPU 6 tg 128 3.02 ± 0.03
mistral 7B all F32 26.98 GiB 7.24 B CPU 6 pp 512 19.94 ± 0.04
mistral 7B all F32 26.98 GiB 7.24 B CPU 6 tg 128 1.52 ± 0.01

And here without sgemm (setting LLAMA_NO_LLAMAFILE=1 make ...):

model size params backend threads test t/s
mistral 7B BF16 13.49 GiB 7.24 B CPU 6 pp 512 10.80 ± 0.04
mistral 7B BF16 13.49 GiB 7.24 B CPU 6 tg 128 3.24 ± 0.02
mistral 7B F16 13.49 GiB 7.24 B CPU 6 pp 512 16.32 ± 0.04
mistral 7B F16 13.49 GiB 7.24 B CPU 6 tg 128 3.26 ± 0.03
mistral 7B all F32 26.98 GiB 7.24 B CPU 6 pp 512 10.60 ± 0.05
mistral 7B all F32 26.98 GiB 7.24 B CPU 6 tg 128 1.65 ± 0.01

So it's not just gains from sgemm.

Running perf on llama-bench on a Mistral-7B-Instruct-v0.2 model showed that more than 95% of the time is spent in ggml_vec_dot_bf16. Here is the annotated disassembly:

Percent│
       │
       │3    Disassembly of section .text:
       │
       │5    000000000002df60 <ggml_vec_dot_bf16>:
       │6    ggml_vec_dot_bf16():
  0.05 │       cmp          $0x1f,%edi
  0.01 │     ↓ jle          410
  0.00 │       lea          -0x20(%rdi),%r8d
       │       vxorps       %xmm3,%xmm3,%xmm3
  0.00 │       mov          %r9,%rax
  0.00 │       mov          %rcx,%rdx
  0.00 │       shr          $0x5,%r8d
  0.00 │       vmovaps      %ymm3,%ymm4
  0.05 │       vmovaps      %ymm3,%ymm5
  0.00 │       vmovaps      %ymm3,%ymm2
  0.00 │       mov          %r8d,%r10d
       │       shl          $0x6,%r10
  0.00 │       lea          0x40(%r9,%r10,1),%r10
  0.00 │       data16       cs nopw 0x0(%rax,%rax,1)
  0.00 │       xchg         %ax,%ax
  0.05 │ 40:   vpmovzxwd    (%rax),%ymm0
  0.53 │       vpmovzxwd    (%rdx),%ymm1
  0.40 │       add          $0x40,%rax
  0.05 │       add          $0x40,%rdx
  0.04 │       vpslld       $0x10,%ymm0,%ymm0
  0.25 │       vpslld       $0x10,%ymm1,%ymm1
  0.04 │       vfmadd231ps  %ymm0,%ymm1,%ymm2
 14.70 │       vpmovzxwd    -0x30(%rax),%ymm1
  0.08 │       vpmovzxwd    -0x30(%rdx),%ymm0
  0.07 │       vpslld       $0x10,%ymm1,%ymm1
  0.20 │       vpslld       $0x10,%ymm0,%ymm0
  0.06 │       vfmadd231ps  %ymm0,%ymm1,%ymm5
 14.98 │       vpmovzxwd    -0x20(%rax),%ymm1
  0.07 │       vpmovzxwd    -0x20(%rdx),%ymm0
 34.90 │       vpslld       $0x10,%ymm1,%ymm1
  0.50 │       vpslld       $0x10,%ymm0,%ymm0
  0.36 │       vfmadd231ps  %ymm0,%ymm1,%ymm4
 15.21 │       vpmovzxwd    -0x10(%rax),%ymm1
  0.14 │       vpmovzxwd    -0x10(%rdx),%ymm0
  0.13 │       vpslld       $0x10,%ymm1,%ymm1
  0.44 │       vpslld       $0x10,%ymm0,%ymm0
  0.05 │       vfmadd231ps  %ymm0,%ymm1,%ymm3
 15.01 │       cmp          %rax,%r10
  0.04 │     ↑ jne          40
       │       vaddps       %ymm5,%ymm2,%ymm0
  0.06 │       vaddps       %ymm4,%ymm3,%ymm3
  0.11 │       lea          0x1(%r8),%edx
  0.00 │       shl          $0x5,%edx
  0.00 │       mov          %edx,%r8d
       │       vaddps       %ymm3,%ymm0,%ymm0
  0.14 │ cd:   vmovaps      %xmm0,%xmm1
  0.00 │       vextractf128 $0x1,%ymm0,%xmm0
  0.18 │       vaddps       %xmm1,%xmm0,%xmm0
  0.16 │       vmovhlps     %xmm0,%xmm0,%xmm1
  0.18 │       vaddps       %xmm0,%xmm1,%xmm0
  0.19 │       vmovshdup    %xmm0,%xmm1
  0.18 │       vaddss       %xmm1,%xmm0,%xmm0
  0.17 │       vcvtss2sd    %xmm0,%xmm0,%xmm4
  0.17 │       cmp          %r8d,%edi
  0.00 │     ↓ jle          400

The majority of time is spent in vpmovzxwd and vpslld instructions. My guess this has more to do with waiting for memory than anything else, since the same instructions seem to run quite fast in other locations.

I toyed around with different amounts of unrolling and reordering of the instructions, but that did not really yield any improvements.


Just for fun, here is the result of leaving vectorization of ggml_vec_dot_bf16 to the compiler:

model size params backend threads test t/s
mistral 7B BF16 13.49 GiB 7.24 B CPU 6 pp 512 4.24 ± 0.02
mistral 7B BF16 13.49 GiB 7.24 B CPU 6 tg 128 3.13 ± 0.02

While the intrinsic version does seem to help in prompt processing, the impact for token generation is rather small for some reason.

Here is the profile of the loop:

  0.00 │ 30:   vpmovzxwd    (%rcx,%rax,1),%ymm1
 19.45 │       vpmovzxwd    (%r9,%rax,1),%ymm0
  0.04 │       vmovdqu      (%rcx,%rax,1),%ymm5
  0.00 │       vmovdqu      (%r9,%rax,1),%ymm6
  0.33 │       add          $0x20,%rax
  0.00 │       vpslld       $0x10,%ymm0,%ymm0
  0.01 │       vpslld       $0x10,%ymm1,%ymm1
  6.67 │       vmulps       %ymm0,%ymm1,%ymm1
  0.54 │       vextracti128 $0x1,%ymm6,%xmm2
  0.06 │       vextracti128 $0x1,%ymm5,%xmm0
  0.01 │       vpmovzxwd    %xmm0,%ymm0
  0.09 │       vpmovzxwd    %xmm2,%ymm2
  0.09 │       vpslld       $0x10,%ymm2,%ymm2
  6.59 │       vpslld       $0x10,%ymm0,%ymm0
  0.11 │       vmulps       %ymm2,%ymm0,%ymm0
  1.17 │       vcvtps2pd    %xmm1,%ymm2
  0.01 │       vextractf128 $0x1,%ymm1,%xmm1
  0.00 │       vcvtps2pd    %xmm1,%ymm1
  0.18 │       vaddpd       %ymm1,%ymm2,%ymm1
  6.81 │       vcvtps2pd    %xmm0,%ymm2
  1.26 │       vextractf128 $0x1,%ymm0,%xmm0
  0.47 │       vcvtps2pd    %xmm0,%ymm0
  8.89 │       vaddpd       %ymm0,%ymm2,%ymm0
 11.95 │       vaddpd       %ymm0,%ymm1,%ymm0
 15.98 │       vaddpd       %ymm0,%ymm3,%ymm3
 18.85 │       cmp          %rax,%rdx
  0.01 │     ↑ jne          30
@lemmi lemmi added the enhancement New feature or request label May 9, 2024
@jart
Copy link
Contributor

jart commented May 11, 2024

That's because tinyBLAS kernels for BF16 haven't been added yet. I've included those in #6840 which should fix this when merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants