Bug: Quantized kv cache caused performance drop on Apple silicon

### What happened?

Observed performance drop for quantized kv cache with flash attention. Both pp and tg are 1/3 slower when flash attention enable and one of k,v is quantized.
Here are some benchmark results on iPhone 14(A15) metal. I observed similar results on my M1 mac with llama-bench too.
FP16 kv cache
| model | size | params | backend | test | t/s |
| --- | --- | --- | --- | --- | --- |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 114.81 ± 12.26 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 9.72 ± 0.04 |

Fp16 kv cache with flash attention
| model | size | params | backend | test | t/s |
| --- | --- | --- | --- | --- | --- |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 126.93 ± 7.30 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 9.95 ± 0.04 |

q8_0 k, fp 16 v
| model | size | params | backend | test | t/s |
| --- | --- | --- | --- | --- | --- |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 117.71 ± 10.46 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 9.81 ± 0.04 | 

q8_0 k, fp 16 v with flash attention
| model | size | params | backend | test | t/s |
| --- | --- | --- | --- | --- | --- |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 85.58 ± 1.93 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 6.77 ± 0.01 |

fp16 k, q8_0 k with flash attention
| model | size | params | backend | test | t/s |
| --- | --- | --- | --- | --- | --- |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 77.47 ± 3.22 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 6.69 ± 0.00 |

q8_0 k,q8_0 v with flash attention
| model | size | params | backend | test | t/s |
| --- | --- | --- | --- | --- | --- |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | pp 512 | 80.99 ± 1.72 |
| phi3 3B Q4_K - Medium | 2.23 GiB | 3.82 B | Metal | tg 128 | 6.78 ± 0.10 |


### Name and Version

./llama.swiftui, version 3488(75af08c47)

### What operating system are you seeing the problem on?

Mac

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Quantized kv cache caused performance drop on Apple silicon #8918

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	test	t/s
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	pp 512	114.81 ± 12.26
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	tg 128	9.72 ± 0.04

Bug: Quantized kv cache caused performance drop on Apple silicon #8918

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions