You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Observations: Clang does not like llama.cpp fp16/Q8_0 at least with my CPU (EPYC 7F72). Going with stock make with clang we have .08 t/s slower inference and 8ish t/s slower prompt processing. Ofast however fixed inference speed to be the same as GCC. However when using K quants it's faster by a respectable amount. See the K test section to see how it runs better. So normal Q8/fp16 is faster on GCC However when you use K quants it's faster.
Note: For GCC with or without LLAMA_FAST made almost no significant difference. Just like .1 faster pp (in the margin of error).
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Observations: Clang does not like llama.cpp fp16/Q8_0 at least with my CPU (EPYC 7F72). Going with stock make with clang we have .08 t/s slower inference and 8ish t/s slower prompt processing. Ofast however fixed inference speed to be the same as GCC. However when using K quants it's faster by a respectable amount. See the K test section to see how it runs better. So normal Q8/fp16 is faster on GCC However when you use K quants it's faster.
Note: For GCC with or without LLAMA_FAST made almost no significant difference. Just like .1 faster pp (in the margin of error).
./llama-bench -m /mnt/36TB/AI/llama-3-8B-Instruct-abliterated/ggml-model-f16.gguf -t 24 -r 5 -pg 512,128
Clang 18 make DEFCC=clang-18 DEFCXX=clang++-18 -j:
With: LLAMA_FAST=1 aka with Ofast vs O3
GCC 12 make -j:
Clang 18:
./llama-bench -m /mnt/36TB/AI/llama-3-8B-Instruct-abliterated/ggml-model-f16.gguf -t 24 -r 5 -p 64,128,265,512,768,1024 -n 0
GCC 12:
./llama-bench -m /mnt/36TB/AI/llama-3-8B-Instruct-abliterated/ggml-model-f16.gguf -t 24 -r 5 -p 64,128,265,512,768,1024 -n 0
################ K Quant TEST ################
GCC 12:
./llama-bench -m /mnt/36TB/AI/llama-3-70B-Instruct-abliterated/ggml-model-Q6_K.gguf -t 24 -r 3 -p 64 -n 64
./llama-bench -m /mnt/36TB/AI/llama-3-70B-Instruct-abliterated/ggml-model-Q8_0 -t 24 -r 3 -p 64 -n 64
./llama-bench -m /mnt/36TB/AI/llama-3-70B-Instruct-abliterated/ggml-model-fp16.gguf -t 24 -r 3 -p 64 -n 64
Clang 18
./llama-bench -m /mnt/36TB/AI/llama-3-70B-Instruct-abliterated/ggml-model-Q6_K.gguf -t 24 -r 3 -p 64 -n 64
./llama-bench -m /mnt/36TB/AI/llama-3-70B-Instruct-abliterated/ggml-model-Q8_0 -t 24 -r 3 -p 64 -n 64
./llama-bench -m /mnt/36TB/AI/llama-3-70B-Instruct-abliterated/ggml-model-fp16.gguf -t 24 -r 3 -p 64 -n 32
Beta Was this translation helpful? Give feedback.
All reactions