Skip to content

fixed typo#178

Merged
ggerganov merged 1 commit intoggml-org:masterfrom
moritzbrantner:master
Mar 15, 2023
Merged

fixed typo#178
ggerganov merged 1 commit intoggml-org:masterfrom
moritzbrantner:master

Conversation

@moritzbrantner
Copy link
Contributor

No description provided.

@ggerganov ggerganov merged commit 27944c4 into ggml-org:master Mar 15, 2023
@gjmulder gjmulder added the bug Something isn't working label Mar 15, 2023
blackhole89 pushed a commit that referenced this pull request Mar 15, 2023
rooprob pushed a commit to rooprob/llama.cpp that referenced this pull request Aug 2, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
…quency-penalty

Document presence frequency penalty
SamuelOliveirads pushed a commit to SamuelOliveirads/llama.cpp that referenced this pull request Dec 29, 2025
* Try interleaving 8 rows for iq4_xs

On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).

* Try interleaving 8 iq4_xs rows

It is also faster on AVX2.

This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).

So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.

* Cleanup

* 8-rows interleaved q8_0 (AVX2)

* 8-rows interleaved q8_0 (Zen4)

* 8-rows interleaved q8_0 (Zen4) - slightly better

PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.

* 8-rows interleaved q8_0 (NEON)

PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.

* FA: repack Q8_0 to Q8_0_R8

* Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4)

* FA: repack Q8_0 to Q8_0_R8 (NEON)

Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants