[WIP] Improve performance on x86 #295

TheSteveHan · 2023-03-19T14:48:11Z

Someone please take over this pull request?

Unfortunately, I'm quite behind on a few other obligations so I won't be able to continue exploring here. Feel free to take this as an inspiration and make a production ready version!

I did some initial exploration on various ways to squeeze more performance out of the main loop on my ubuntu desktop with an i7-7700k CPU.

Code was complied with gcc-10 and invoked with
./main -m ./models/7B/ggml-model-q4_0.bin -s 1679164839 -m ./models/7B/ggml-model-q4_0.bin -n 1280

Since inference is usually memory bound, I specifically looked for ways to improve memory access.
Seems like a combination of prefetch +CPU pinning + unroll can improve the performance by up to ~25%

These changes here is only tested on my machine and I suspect the code won't even compile for other platforms.

sw · 2023-03-19T20:14:32Z

This worked with gcc 11.3 but gave only a slight improvement. I've been looking at that function as well since it's clearly the hotspot. I was playing around with the AVX2 instructions but it seems to be pretty much memory-bound.

I tried using _mm256_maddubs_epi16 as described here, but didn't see a consistent improvement.

TheSteveHan · 2023-03-19T23:06:07Z

This worked with gcc 11.3 but gave only a slight improvement. I've been looking at that function as well since it's clearly the hotspot. I was playing around with the AVX2 instructions but it seems to be pretty much memory-bound.

I tried using _mm256_maddubs_epi16 as described here, but didn't see a consistent improvement.

The hard coded 32 in the prefetch distance is quite arbitrary. I wonder if different numbers would work better for your machine.

As you reach the limit of your machine, other stuff you have running would start to create more variation of the measured performance too. So you might have to run it multiple times and compare the best performances. The original code runs very consistently at ~425+-5 ms/token for me, however the modified version varies between 340 and 380 between runs for me.

blackhole89 · 2023-03-21T22:00:40Z

I was independently trying to do something similar on the Q4_1 code here. I managed to squeeze out somewhere around 5% more performance by rearranging the SIMD math and avoiding a double load on the constant offsets, but saw no improvements from prefetching anything on my setup (a Skylake mobile Xeon, GCC 11).

TheSteveHan · 2023-03-21T23:51:38Z

but saw no improvements from prefetching anything on my setup (a Skylake mobile Xeon, GCC 11).

I took a quick look at wiki for Skylake mobile Xeon, looks like the L3 cache size there (8MB) is less than the L1 cache (13MB) on this i7700 desktop chip. The prefetch distance in this PR might be way too far for your chip? Here it's also trying to prefetch in to L1, you might have better luck prefetching into L3 given the smaller cache size?

x02Sylvie · 2023-03-25T17:21:30Z

Personally noticed improvement on 10700kf on windows 10

From 270 ms to 241 ms on 13b alpaca, although only part I added from this commit was the main loop modification since no #include <sched.h>on windows and thread stuff would need adjustment i assume to work on windows aswell

performance gains prolly could be bigger on 30 b and 65 b models, aswell as if I got thread affinity stuff going on windows

ggerganov · 2023-04-13T12:53:07Z

Please reopen when and if this is ready to merge

…aterial-9.1.15 Bump mkdocs-material from 9.1.14 to 9.1.15

[WIP] x86 performance improvements

acf9e52

gjmulder added enhancement New feature or request performance Speed related topics labels Mar 20, 2023

This was referenced Mar 30, 2023

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603

Closed

AVX2 ggml_vec_dot_q4_0 performance improvement ~5% #768

Closed

ggerganov closed this Apr 13, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Merge pull request ggerganov#295 from abetlen/dependabot/pip/mkdocs-m…

f5d136d

…aterial-9.1.15 Bump mkdocs-material from 9.1.14 to 9.1.15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Improve performance on x86 #295

[WIP] Improve performance on x86 #295

TheSteveHan commented Mar 19, 2023

sw commented Mar 19, 2023

TheSteveHan commented Mar 19, 2023 •

edited

Loading

blackhole89 commented Mar 21, 2023

TheSteveHan commented Mar 21, 2023 •

edited

Loading

x02Sylvie commented Mar 25, 2023 •

edited

Loading

ggerganov commented Apr 13, 2023

[WIP] Improve performance on x86 #295

[WIP] Improve performance on x86 #295

Conversation

TheSteveHan commented Mar 19, 2023

Someone please take over this pull request?

sw commented Mar 19, 2023

TheSteveHan commented Mar 19, 2023 • edited Loading

blackhole89 commented Mar 21, 2023

TheSteveHan commented Mar 21, 2023 • edited Loading

x02Sylvie commented Mar 25, 2023 • edited Loading

ggerganov commented Apr 13, 2023

TheSteveHan commented Mar 19, 2023 •

edited

Loading

TheSteveHan commented Mar 21, 2023 •

edited

Loading

x02Sylvie commented Mar 25, 2023 •

edited

Loading