You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was interested what impact the BF16 format #6412 had on my cpu (AMD Ryzen Embedded V3000 V3C48, which uses Zen 3 cores, 4800MT/s ECC RAM).
Surprisingly the prompt processing was just half of the performance compared to F16 and F32 formats. Token generation is slightly faster.
Here is llama-bench with sgemm:
model
size
params
backend
threads
test
t/s
mistral 7B BF16
13.49 GiB
7.24 B
CPU
6
pp 512
11.18 ± 0.05
mistral 7B BF16
13.49 GiB
7.24 B
CPU
6
tg 128
3.26 ± 0.03
mistral 7B F16
13.49 GiB
7.24 B
CPU
6
pp 512
23.29 ± 0.07
mistral 7B F16
13.49 GiB
7.24 B
CPU
6
tg 128
3.02 ± 0.03
mistral 7B all F32
26.98 GiB
7.24 B
CPU
6
pp 512
19.94 ± 0.04
mistral 7B all F32
26.98 GiB
7.24 B
CPU
6
tg 128
1.52 ± 0.01
And here without sgemm (setting LLAMA_NO_LLAMAFILE=1 make ...):
model
size
params
backend
threads
test
t/s
mistral 7B BF16
13.49 GiB
7.24 B
CPU
6
pp 512
10.80 ± 0.04
mistral 7B BF16
13.49 GiB
7.24 B
CPU
6
tg 128
3.24 ± 0.02
mistral 7B F16
13.49 GiB
7.24 B
CPU
6
pp 512
16.32 ± 0.04
mistral 7B F16
13.49 GiB
7.24 B
CPU
6
tg 128
3.26 ± 0.03
mistral 7B all F32
26.98 GiB
7.24 B
CPU
6
pp 512
10.60 ± 0.05
mistral 7B all F32
26.98 GiB
7.24 B
CPU
6
tg 128
1.65 ± 0.01
So it's not just gains from sgemm.
Running perf on llama-bench on a Mistral-7B-Instruct-v0.2 model showed that more than 95% of the time is spent in ggml_vec_dot_bf16. Here is the annotated disassembly:
The majority of time is spent in vpmovzxwd and vpslld instructions. My guess this has more to do with waiting for memory than anything else, since the same instructions seem to run quite fast in other locations.
I toyed around with different amounts of unrolling and reordering of the instructions, but that did not really yield any improvements.
Just for fun, here is the result of leaving vectorization of ggml_vec_dot_bf16 to the compiler:
model
size
params
backend
threads
test
t/s
mistral 7B BF16
13.49 GiB
7.24 B
CPU
6
pp 512
4.24 ± 0.02
mistral 7B BF16
13.49 GiB
7.24 B
CPU
6
tg 128
3.13 ± 0.02
While the intrinsic version does seem to help in prompt processing, the impact for token generation is rather small for some reason.
I was interested what impact the BF16 format #6412 had on my cpu (AMD Ryzen Embedded V3000 V3C48, which uses Zen 3 cores, 4800MT/s ECC RAM).
Surprisingly the prompt processing was just half of the performance compared to F16 and F32 formats. Token generation is slightly faster.
Here is
llama-bench
withsgemm
:And here without
sgemm
(settingLLAMA_NO_LLAMAFILE=1 make ...
):So it's not just gains from
sgemm
.Running
perf
onllama-bench
on aMistral-7B-Instruct-v0.2
model showed that more than 95% of the time is spent inggml_vec_dot_bf16
. Here is the annotated disassembly:The majority of time is spent in
vpmovzxwd
andvpslld
instructions. My guess this has more to do with waiting for memory than anything else, since the same instructions seem to run quite fast in other locations.I toyed around with different amounts of unrolling and reordering of the instructions, but that did not really yield any improvements.
Just for fun, here is the result of leaving vectorization of
ggml_vec_dot_bf16
to the compiler:While the intrinsic version does seem to help in prompt processing, the impact for token generation is rather small for some reason.
Here is the profile of the loop:
The text was updated successfully, but these errors were encountered: