Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Improve performance on x86 #295

Closed
wants to merge 1 commit into from

Conversation

TheSteveHan
Copy link

Someone please take over this pull request?

Unfortunately, I'm quite behind on a few other obligations so I won't be able to continue exploring here. Feel free to take this as an inspiration and make a production ready version!


I did some initial exploration on various ways to squeeze more performance out of the main loop on my ubuntu desktop with an i7-7700k CPU.

Code was complied with gcc-10 and invoked with
./main -m ./models/7B/ggml-model-q4_0.bin -s 1679164839 -m ./models/7B/ggml-model-q4_0.bin -n 1280

Since inference is usually memory bound, I specifically looked for ways to improve memory access.
Seems like a combination of prefetch +CPU pinning + unroll can improve the performance by up to ~25%

Screen Shot 2023-03-18 at 5 27 10 PM

These changes here is only tested on my machine and I suspect the code won't even compile for other platforms.

@sw
Copy link
Collaborator

sw commented Mar 19, 2023

This worked with gcc 11.3 but gave only a slight improvement. I've been looking at that function as well since it's clearly the hotspot. I was playing around with the AVX2 instructions but it seems to be pretty much memory-bound.

I tried using _mm256_maddubs_epi16 as described here, but didn't see a consistent improvement.

@TheSteveHan
Copy link
Author

TheSteveHan commented Mar 19, 2023

This worked with gcc 11.3 but gave only a slight improvement. I've been looking at that function as well since it's clearly the hotspot. I was playing around with the AVX2 instructions but it seems to be pretty much memory-bound.

I tried using _mm256_maddubs_epi16 as described here, but didn't see a consistent improvement.

The hard coded 32 in the prefetch distance is quite arbitrary. I wonder if different numbers would work better for your machine.

As you reach the limit of your machine, other stuff you have running would start to create more variation of the measured performance too. So you might have to run it multiple times and compare the best performances. The original code runs very consistently at ~425+-5 ms/token for me, however the modified version varies between 340 and 380 between runs for me.

@gjmulder gjmulder added enhancement New feature or request performance Speed related topics labels Mar 20, 2023
@blackhole89
Copy link
Collaborator

I was independently trying to do something similar on the Q4_1 code here. I managed to squeeze out somewhere around 5% more performance by rearranging the SIMD math and avoiding a double load on the constant offsets, but saw no improvements from prefetching anything on my setup (a Skylake mobile Xeon, GCC 11).

@TheSteveHan
Copy link
Author

TheSteveHan commented Mar 21, 2023

but saw no improvements from prefetching anything on my setup (a Skylake mobile Xeon, GCC 11).

I took a quick look at wiki for Skylake mobile Xeon, looks like the L3 cache size there (8MB) is less than the L1 cache (13MB) on this i7700 desktop chip. The prefetch distance in this PR might be way too far for your chip? Here it's also trying to prefetch in to L1, you might have better luck prefetching into L3 given the smaller cache size?

@x02Sylvie
Copy link

x02Sylvie commented Mar 25, 2023

Personally noticed improvement on 10700kf on windows 10

From 270 ms to 241 ms on 13b alpaca, although only part I added from this commit was the main loop modification since no #include <sched.h>on windows and thread stuff would need adjustment i assume to work on windows aswell

performance gains prolly could be bigger on 30 b and 65 b models, aswell as if I got thread affinity stuff going on windows

@ggerganov
Copy link
Owner

Please reopen when and if this is ready to merge

@ggerganov ggerganov closed this Apr 13, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
…aterial-9.1.15

Bump mkdocs-material from 9.1.14 to 9.1.15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants