-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Closed
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale
Description
What happened?
Observed performance drop for quantized kv cache with flash attention. Both pp and tg are 1/3 slower when flash attention enable and one of k,v is quantized.
Here are some benchmark results on iPhone 14(A15) metal. I observed similar results on my M1 mac with llama-bench too.
FP16 kv cache
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 114.81 ± 12.26 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 9.72 ± 0.04 |
Fp16 kv cache with flash attention
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 126.93 ± 7.30 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 9.95 ± 0.04 |
q8_0 k, fp 16 v
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 117.71 ± 10.46 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 9.81 ± 0.04 |
q8_0 k, fp 16 v with flash attention
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 85.58 ± 1.93 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 6.77 ± 0.01 |
fp16 k, q8_0 k with flash attention
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 77.47 ± 3.22 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 6.69 ± 0.00 |
q8_0 k,q8_0 v with flash attention
| model | size | params | backend | test | t/s |
|---|---|---|---|---|---|
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 80.99 ± 1.72 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 6.78 ± 0.10 |
Name and Version
./llama.swiftui, version 3488(75af08c)
What operating system are you seeing the problem on?
Mac
Relevant log output
No response
felladrinBIGPPWONG
Metadata
Metadata
Assignees
Labels
bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale