Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603

Closed
MillionthOdin16 opened this issue Mar 29, 2023 · 67 comments · Fixed by #775
Closed

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603

MillionthOdin16 opened this issue Mar 29, 2023 · 67 comments · Fixed by #775
Labels
performance Speed related topics

Comments

@MillionthOdin16
Copy link

MillionthOdin16 commented Mar 29, 2023

Expected Behavior

I am comparing the performance of two executables: llama.cpp (current version) and the default gpt4all executable (which uses a previous version of llama.cpp). I am using the same language model for both executables, and I expect the current version of llama.cpp (which is built specifically for the hardware) to perform at least as fast as the default gpt4all executable.

Current Behavior

The default gpt4all executable, which uses a previous version of llama.cpp, performs significantly faster than the current version of llama.cpp. Despite building the current version of llama.cpp with hardware-specific compiler flags, it consistently performs significantly slower when using the same model as the default gpt4all executable.

Environment and Context

I am running the comparison on a Windows platform, using the default gpt4all executable and the current version of llama.cpp included in the gpt4all project. The version of llama.cpp is the latest available (after the compatibility with the gpt4all model).

Steps to Reproduce

  1. Build the current version of llama.cpp with hardware-specific compiler flags.
  2. Execute the llama.cpp executable using the gpt4all language model and record the performance metrics.
  3. Execute the default gpt4all executable (previous version of llama.cpp) using the same language model and record the performance metrics.
  4. You'll see that the gpt4all executable generates output significantly faster for any number of threads or config.

Here's some context/config when I'm doing the runs:

image
(left panel is latest llama.cpp, right panel is gpt4all build)

This is the older version that gpt4all uses (with some tweaks): https://github.com/zanussbaum/gpt4all.cpp

*To quickly test the difference yourself you can use the gpt4all default binaries here: https://github.com/nomic-ai/gpt4all/tree/main/chat

@sw
Copy link
Collaborator

sw commented Mar 29, 2023

Screenshot on the left has FMA and SSE3 enabled, is that the one that is faster? Try building both with the same flags.

@MillionthOdin16
Copy link
Author

No, that's what I'm saying. The left is optimized llama.cpp, the right is unoptimized gpt4all. The unoptimized gpt4all is significantly faster than the optimized llama. So there's something wrong.

@slaren
Copy link
Collaborator

slaren commented Mar 29, 2023

Without bisecting the exact commit that introduced a performance regression it is hard to do much about it. I suspect that it happened when the code for transposed matrix multiplication was replaced with a copy.

@MillionthOdin16
Copy link
Author

First I'm just trying to check if someone else can reproduce the behavior. I def wanna use llama version because it has more features haha

@ggerganov
Copy link
Owner

Double check you have built llama.cpp in Release

@x02Sylvie
Copy link

Personally had small performance regression (13b model) over last 14 days, from 207 ms to 238 ms, but its no biggie

@xor2003
Copy link

xor2003 commented Mar 29, 2023

I see significant slowness when comparing on windows latest llama.cpp 30B with gpt4all lora

@dakennedyd
Copy link
Contributor

dakennedyd commented Mar 29, 2023

I also notice performance drops on x86-64 Linux, it also uses a lot more memory than before. I compiled the project following the instructions on the Readme.md

@jessejohnson
Copy link
Contributor

I am using the same language model for both executables

@MillionthOdin16 How did you do that? I couldn't get this to work yesterday. See this issue

@MillionthOdin16
Copy link
Author

MillionthOdin16 commented Mar 29, 2023

I am using the same language model for both executables

@MillionthOdin16 How did you do that? I couldn't get this to work yesterday. See this issue

Did you try it in the last few hours? There were some commits couple hours ago that made it easy. Look at the readme section that was added for GPT4all

It kinda drives me crazy that the one dude forked llama.cpp then stopped maintaining it because other repos are forking his repo which is outdated ☹️

@alph4b3th
Copy link

what is gpt4all and what changes to llama.cpp and alpaca.cpp? I arrived by parachute here in the conversation, I don't really know the context, I apologize

@anzz1
Copy link
Contributor

anzz1 commented Mar 30, 2023

We should definitely look into this as this definitely shouldn't be the case. I'm pretty confident though that enabling the optimizations didn't do that since when we did that #375 the perf was pretty well researched. If performance got lost and memory usage went up somewhere along the way, we'll need to look at where this happened. If it doesn't run well, everyone just going to fork from a older point in time instead.

@BrunoIsaac27

alpaca.cpp is pretty much a fork from a older llama.cpp version (which is apparently faster), but nothing much is really changed except changing a few default variables.

gpt4all is a fork from the alpaca.cpp fork with modifications tailored specifically to the gpt4all model.

@ggerganov
Copy link
Owner

ggerganov commented Mar 30, 2023

Someone on Windows should try and bisect to see where this alleged degradation happens - I don't have a Windows machine. I'm pretty sure there is no performance degradation no Mac OS and Linux. The only slowness introduced, as @slaren mentioned, was the removal of the transposed ggml_mul_mat path which led to about %10 performance loss during single-token inference (i.e. after prompt ingestion). But this should have been compensated by the various updates in the SIMD code.

I guess the ggml_mul_mat change could have had a bigger impact on the prompt inference performance, but only if you are not linking to BLAS. In that case, the solution is to link to BLAS - you will gain significant speed-up during prompt ingestion.

But overall, we need some more specific info about your setup and timing numbers to be able pinpoint the problem.

@anzz1
Copy link
Contributor

anzz1 commented Mar 30, 2023

alpaca.cpp / gpt4all was forked specifically at this point: 9b4a15b
while at time of writing, we are here: 9cbc404

There is exactly 180 commits between then and now, obviously too many to test them manually.
Here is the full list of commits for reference:

Commit list
9cbc404ba6699a9ba4925ea25a60552b13491c7a <--- now
b51c717d5cf9181c33afcb84554e47f6d539c891
0ba76c1e73ae21038b80bfb5a746157376c88173
cea1c859483a5cfc7e2b31a06f8561d7a7604870
f202ada131f60059112a948f660b2e0ac93d049a
3b44d30d9b618f0f2eb9abcfe912770a4e7d85d4
61cbfff5c95e45236883b1b60e025f8f6fa8c8a3
d9ad104440d84a0cc0734bff47ef0ba41ba740c4
b467702b87461543c75013207e9adc6d20dcc01d
516d88e75c9e768c0001a452dbad212494c586b3
53635c081c49321d523567112f9fddfbba6b787b
41318d708ed196ff727dce14d263a64b23c7333d
a6956b25a1c783e5e96fe06c9c00438f846ef047
83df5639eb182ed7c122382907691d8baa3c32df
a5c42c4b13b3be9e58fe8f9adbb6ee60417674a6
5a5f8b1501fbb34367225544010ddfc306d6d2fe
f1217055eaedfc7214be93d98e529cae89830430
7f4c5c66514227c3870c2bd189fb0609fdd0de10
2a98bc18ea34dbf15f261a0df37080e588a189d1
d0aaff571cd5c316b68e3e11d57e274bfd2bd457
d0330fd783d7c67349cdcce4a56604ef0aeccdb5
99c5b2765422232ebb4414f5a63693d734406a7f
692ce3164ef1201ecb9cfad315cc0a08b965adb8
96f9c0506fa81cada6f96f45768c34f45406c4bb
d502bc7c9d9d6dfb3a09aea404395b666d7b374d
436e56193199a1625f8c561069f702e8840a9e08
20e1e84884376b3fb44ffbfd48d478b2934b0b5e
c1f885067c61191a07a1aedf684168dda62f3f71
e0670260fb50a882b37074112b1881fb0820cf77
28ba975aea1dcae2f31770516f5d542ff177771e
a6bdc47cba23713a22ade47dd65b6afeb8009ff4
7b8dbcb78b2f65c4676e41da215800d65846edd0
4b8efff0e3945090379aa2f897ff125c8f9cdbae
7e5395575a3360598f2565c73c8a2ec0c0abbdb8
34c1072e497eb92d81ee7c0e12aa6741496a41c6
939ad2d3a56815f480b6fd5ea432a7ee576a7e6b
8c2ec5e21d580c99e257c3cfddcf21fa53229aa4
b391579db92f095666be1d979899b54ae0981573
7a87d31f4f0c37bbb2ea695929fa4fe3ad579cda
348d6926ee31d4476f9b90e1a627b0925a70f847
33e35b8fe8f09adcac0632e9cece62e1dd629f7d
19726169b379bebc96189673a19b89ab1d307659
f732695cd57fb41e3a1be625cec4edf5be45b40a
2f7bf7dd7cd7299874d582f7f34834418abf4057
34ab5268432fd287caa68d60bdd8aef411def3fa
c2b25b6912662d2637d9c6e6df3a5de931e0d7ce
79b2b266db6b198b5af450982c3cd61120fac951
e2d490dafd860eaaaf9aa8008ab790527d556daf
03f7e335604b3d68f74995aa2ccb4955833ee423
55ad42af845127bd0eb0c1f36f327ecec83f4bca
459e93cce07cab9052c06b5bf360819893442e1e
a316a425d04027453dc0fd45f003b647c12f66f9
ecbe466a364876927994e2f1ec14f4d82301d201
502a400192013d3e95ed87b777e8fa3bec45713c
09aecbf6283bbce9449e2d96000073145aaaf5fc
4640eff23d341a0273587800e17ff4a378132d60
ab77d7631211b299cb734bea6ad1f74324154150
29b7baab670ae8b76ac0da21c2ded69ff18971ee
4a7129acd2e939b92d70dd568c746f2fa078232c
6b6dbc8910c6d53f4d96c46c8fcec70e2cd435d8
2a2e63ce0503d9bf3e55283e40a052c78c1cc3a8
e899bf54b291e8c84173a0e534a2c262f3f63229
fbd4d38c647f82b2598291ea9b8d0c09ac1ffb8c
58e6c9f36f97d0a3e287b97256dc5f6b0e9fb5ae
36d07532ef7ccf0bdc12e050472f359a6794957f
6f1ee4b640912211a4b07965c585d327e32e734d
8520fc310eab87f2c4612f2a00d4adbd44a20d0d
b3f460e94139cb24b0af81cc8bc10eb86269d704
04c6f5ed6fafd63601fa06757877ed5ccf9d5991
7a9b6c3a8bdc1cb75fefc826dfaa7331eb63695d
31572d966531f7d768eb773322016ab78eb6e835
f4f5362edb01b05c383b23f36d7b3489c77061b5
863f65e2e32dc1e6d23c96a4811bf382d6b2a548
afd220d9c665e4c19107120ace2f0cb742e28aa1
481044d50cfe8eaa6cd0c1a1b445680e4b0b3ebc
563cdc391dde140f1084d1012234e8e6f57f881f
8d4a855c241ecb0f3ddc03447fe56002ebf27a37
b6b268d4415fd3b3e53f22b6619b724d4928f713
3cd8dde0d1357b7f11bdd25c45d5bf5e97e284a0
4870e455b3653f7d7769fa5772b2c90ffad088df
483bab2e3d4a868fe679d8bb32827d2a4df214dc
404e1da38ec8025707031a8027da14dc1590f952
4cc053b6d5e9df7ac21fa06b7208a70c156d4d7a
0ba5a3a9a5efedb1aeecbbc70a4e9825542472d5
2e17dfd80a473099dacc0f41c9146d233c6a5972
20a1a4e09c522a80e2a0db51643d25fa38326065
ad072fc5ad6f6905a7224ff6ea07c0644aa075b1
ea10d3ded2994106596ddf8e4ed02741b3e053e6
a18c19259a3cb9dec332d613e8f15704f678a468
a50e39c6fe36be3de0941b3c05aaf9c37912fd47
a140219e81cfb80356438112cd2290d701b282bb
8a3e5ef801339e57b9b0449220e9ffb11a6648e2
8eea5ae0e5f31238a97c79ea9103c27647380e37
93208cfb929c2323e5d2ac6bf354e278040e70ed
03ace14cfd68a1289ac3b76563534c8ee72a2e53
e4412b45e395981068d2850d3fa04cc16c77d70d
f7dc43bc0d759732815856183246f167111587ad
ee8a7887865a893be208e0a92d6d94d2cb66a789
69c92298a9e36dc2363b3bf50452976ce49487b3
97940520e8fd49c56bb29b71cc350190b723513f
305ba6f0e6daa3796aad9dd18053a1945dd4cc58
4122dffff958cd137175b58f1f27c0913528d7ba
56e659a0b271436e24813a801640d015e7b05328
40ea807a972ec7b5a426f034ebfa593b5e7a06ed
d5850c53ca179b9674b98f35d359763416a3cc11
ae44e23ee36c02da0e37ab508a4b473ace724f8e
928480ef5b7b03d7a07e98286aebe3d8b24457d9
56817b1f882b1894daa4051d0de0bf9a0926d315
f5a77a629bd0f37ae1696747633ab42a5530ec15
da0e9fe90ccf6e73597eb19dd0cfc0a28363fb3b
e6c9e0986c79ba1cc8848879b2fcce979f9b4672
01a297b09932e29f3319d6588977c32a926c7907
3366853e41fcc818222a0271c76b6106179106fb
3f9c6135e45ae3f520b1e17197004cc60c9ca45b
0f6135270839f0715843c4d480c63ae150def419
353ec251a42491f5192c48561da4b444ef67f23c
89d5d90f3b6d25f134da7a8e252c3432bffcf674
16ffc013c62f22bdaa3cdc022d7a13fd952d73fc
486ae645fd3eda8b9d7413d5ff34fb65a3e337fb
3ab3e6582f7320c2b6568c892fdfc8215caf7e6c
f157088cb75f23208abc92b473a132ef3f7a7f15
c86ba036e613d46815501a4c6775117c9fc7afce
1daf4dd71235dbbf537738e7ad53daad8d97586f
dc6a845b8573cd7d06c6b295241d26f311602a1f
6a612959e1b6c37b68b6b141329751a2902b1030
d5f56a5e5a0069329a81f96460221e7afb1daddc
3bfa3b43b7319b71853bfc7d3cf4e9767c24bbc8
715d292ee0e34d27f27af43d7feaad1f1344981d
c98ae02668a25916954b1653e25a5a35ca048d63
c3b2306b18a087799acc431e485b8a2e3162cd52
975d2cebf97ce888fa0aeee6f5ac774d7135891f
e0ffc861fae5ac8b40ce973f822d03db02929d36
8f644a0a859938c787d329d27f98e03c58d7df27
eb34620aeceaf9d9df7fcb19acc17ad41b9f60f8
2e664f1ff413995506c9a54f3a8d5b8c64e37a91
8cf9f34eddc124d4ab28f4d2fe8e99d574510bde
bd4b46d6ba504b99c936f43fc014529adffb6048
6b6d5b5024faaf82019d08cde5e8a9d69c6ca316
a791a68b613b162c88a83f5f0225223bc167c762
0f1b21cb90ac6b84a9af70cafb8e13b5389e3b32
074bea2eb1f1349a0118239c4152914aecaa1be4
5cb63e2493c49bc2c3b9b355696e8dc26cdd0380
da5303c1ea68aa19db829c634f1e10d08d409680
4545539d718cf88f4c3a76669b8ac2e26cd8a1e5
edeba283665591f2f726024a92efe4b0b40434b3
5c19c70ba631a8f5d54feb6634e0eea178911a84
24568371ae0d7caf85164abe4753f36a7dba0288
7392f1cd2cef4dfed41f4db7c4160ab86c0dfcd9
ad5fd5b60cfdfbfb22b0f2bc9e9f6c9692768f8d
368d0c8a9ebae16a20e1c8971b21ee888bdefad5
50fae10d0339f2bd639f69dd679c0201d939a265
084e2f0ec081c929343d44b09df07ae87cd1ed32
0b366e735729327476ec31da02de3c9c9771ddfb
160bfb217da5038ccbd74438f9f16a16012d7866
c494ed5b94b429d3d73721235e78c9f5fa6e5652
c1c7026b470ced0b8a6c67e968c04bb47864def1
467b149761cd63248b00d6ffb204d50a4cbb451a
70f01cb8632f73b5cf70428608b89cd3c0775d23
a4e63b73dfa1894387926cc8072b5f36deebf0a5
9e1707218a24ff758c7b623594f8c0ce5e12eb6c
22213a17b56336bbea384a572a9484ce208c0333
d7def1a7524f712e5ebb7cd02bab0f13aa56a7f9
6f61c18ec9a30416e21ed5abfb1321bdb14979be
1e5a6d088d0f3a967c6e86298a756daec9e8df12
554b54152145c30618bac171efb712cf4a7d1e96
d3f202d57b694376cef6f381a6b6901825c3f6d9
e03e359730c127f888fcf00e93375771bc0a3500
a81d0c2a171a4446e6a21a3ec74a0c0768d71184
b2de7f18dfbb93463eeb5b4392117bbe82d5bd1b
a29274789309029fd88a9465e6d0832d4632272b
c9f670a17755311aa28c411f5c7f3c8c05434770
4f546091102a418ffdc6230f872ac56e5cedb835
e81b9c81c101f64531ef0fa1ee6b77d562635652
367946c668757532deed929e1d78673c6ac6bcb8
6b0df5ccf360fe5c015f6607f0375bfc6849005e
2af23d30434a677c6416812eea52ccc0af65119c
904d2a8d6acd667c9633138d45a361d40fbf76d0
721311070e31464ac12bef9a4444093eb3eaebf7
ac15de789547e5a6e93df552e787379b3a23ef26
273abc47ff9dd899b3c4f58acd19d4649e90d6b4
9b4a15b17d8395eb075379b140fcd0b0283f4ef6 <--- alpaca.cpp/gpt4all was forked here

I was thinking what we'd need is a script which: start -> create directory -> git clone and build commit id -> log the performance of some runs to a file -> remove directory -> loop to start

That would narrow it down to exactly where it happened.

Unfortunately the GitHub action runners are very limited, otherwise incorporating a performance test to be ran on every pull request would be easy to incorporate to the test suite to nip these issues in the bud before the situation devolves like this that we have to go bug hunting through a list of 180 commits.

@sw
Copy link
Collaborator

sw commented Mar 30, 2023

I'm thinking it would be a good thing to measure performance more accurately.

Wall-clock time is not good enough, especially if there's other things happening on the system. You have to let it run for a long time to get a good enough average.

Another pitfall may be that a test suite causes downclocking of the processor. So the first test will get a cold processor running at full speed, and the later tests will have to run on a hot, slow processor.

Maybe getrusage(RUSAGE_THREAD, ...) would be useful here to get per-thread usage information, which could then be collected by the thread pool manager? Of course you could use RUSAGE_SELF to get the data for all threads combined, but maybe we want to see if all threads get used equally. We might also look into RDTSC and whatever the equivalent of getrusage is on Windows.

@anzz1
Copy link
Contributor

anzz1 commented Mar 30, 2023

@sw Very good points. "Preheating" a processor before a set of runs would ensure the stability between other sets of runs. Then again, the first-run(s) advantage quickly dissipates when doing a larger set of runs. Especially on desktop computers with good cooling this isn't much of a problem unless you happened to just fire up your PC and instantly start a test. There is also the problem, especially in the case of Windows, beginning in Windows 8 and onwards, each new iteration being worse than the one before, of the OS being so noisy that it affects the performance greatly so a large set of runs is required to achieve anything resembling an accurate average.

That would be more about perf testing generally though, as in this case where the perf drop is significant enough to be able to be visually inspected, probably just a run or three would be enough to narrow down where it happened.

I think that is the most feasible way to go about this since manually trying to figure out from the now much-changed codebase where the problem lies would be harder than just letting a script run and perf test tell you that.

edit: the downclocking part you linked, that's exactly the thing I was trying to remember as I actually posted something about earlier of AVX512 sometimes having worse performance for some workloads especially in the earlier Intel processors which first introduced the set, but didn't remember what was exactly the cause but that was definitely it. That whatever improvement AVX512 brought to the table was offset by the downclocking so that the overall performance actually decreased.

@slaren
Copy link
Collaborator

slaren commented Mar 30, 2023

I was thinking what we'd need is a script which: start -> create directory -> git clone and build commit id -> log the performance of some runs to a file -> remove directory -> loop to start

This can be done as a binary search too, the git bisect command should help with that.

@anzz1
Copy link
Contributor

anzz1 commented Mar 30, 2023

This can be done as a binary search too, the git bisect command should help with that.

Very interesting, I actually had no idea such a command existed. Seems useful in many cases. However in this case since there also was the format change somewhere in the middle, it'd simply be easier to go about it sequentially until you run to the one where the format was changed, change the model only once and proceed. Or run two binary searches for the ones before and after the model change, that is also an option.

To be honest when already went to the trouble of setting the script up and have some compute time set aside for it, a sequential run would also give a log of the performance deltas of all the commits and which ones increased performance and which ones decreased it, as it might not be any single commit that's causing it but a pile up of smaller decreases here and there. There obviously has been steps forward but also steps back in the bunch.

@KASR
Copy link
Contributor

KASR commented Mar 30, 2023

I've written a small python script to benchmark the token time as a function of number of threads. I've added the script in attachment if anyone want to try it. ( --> benchmark_threads.txt , I had to change the extension in order to upload ) It could be useful to benchmark performance for different versions. If not just ignore this message :)

Below you can see the result from my pc. I'm using windows 10, the typical system information looks like:

system_info: n_threads = xx/ 36 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

I didn't go up to the 36 threads, you can see the results below. The script runs each prompt 5 times and plots the average token times. For some reason the timings go up at around 18 threads (i.e. the number of physical cores). I will try it again later to see how robust the timings are.

Feel free to propose suggestions to make the benchmark more reliable.

Token_time__threads

Edit: I've updated the plot so that it includes the eval and prompt eval as well (don't mind the typo's). It's really strange why the performs drops around 18 (i.e. the number of physical cores) and afterwards drops again...

Token_time__threads_v2

@slaren
Copy link
Collaborator

slaren commented Mar 30, 2023

@KASR at the moment it seems that you are only measuring the prompt eval time, I would recommend considering the prompt eval time and the regular (generation) eval time separately. You can safely assume that 1 run = 1 token in the eval time.

@anzz1
Copy link
Contributor

anzz1 commented Mar 30, 2023

@KASR
That looks good, if you were to do a similar graph but where the x-axis was not threads but commits,
x = 0 being 9b4a15b
and x = 180 being 9cbc404

It would solve our problem right then and there of figuring out where the increases and decreases happened.

@MillionthOdin16
Copy link
Author

Wow, thanks for the critical thinking guys. You've mentioned some pretty interesting points. It's pretty crazy to see what affects performance, and looking at some of the discussions it seems like there are things like Intel perf cores they can have a significant impact (although not in my case 🙃).

I can definitely help with testing and performance metrics, I just need to make a script that'll get reliable builds between versions for my environment. It's pretty picky and often needs tweaks to make the build succeed.

One of the differences/struggles right now is that the current llamaCPP gives much more performance metric info than the build used in gpt4all. So it's hard to see the specific timings in the older gpt4all version. Apart from that, I'd want to make sure that the info I'm collecting while running builds for specific commits is actually the info that will help us.

But overall, we need some more specific info about your setup and timing numbers to be able pinpoint the problem.

As for my build and build process, I have a Ryzen 3900x (12c, 24t) and use CMake and ninja to build my executables. I've also built with blas linked, but haven't seen a noticeable difference while using the library vs not. Other than that I use avx avx2 maxv maxv2 f16c sss3 on Release. I usually run with -t 8. And the models I use are the 4-bit quantized 7B.

I guess the ggml_mul_mat change could have had a bigger impact on the prompt inference performance, but only if you are not linking to BLAS. In that case, the solution is to link to BLAS - you will gain significant speed-up during prompt ingestion.

Where should I expect to see the performance increases when I'm running with BLAS? Is it during larger completions after the prompt is loaded?

I could also do things in WSL2, but I'm not sure about the performance impacts, which is why I currently don't use it. If you think it would be better let me know.

Again, awesome job guys! You're having a huge impact on making these models accessible to normal people 🔥🔥🔥

@anzz1
Copy link
Contributor

anzz1 commented Mar 30, 2023

@MillionthOdin16 For Windows definitely the most common configuration (4 or 8 threads, AVX=yes, SSE3=yes, F16C=yes, AVX2=yes , AVX512=no , BLAS=no , WSL2=no) would be the best to base the benchmarks on. Obviously if you want and have the time for it, more is always better.

The most important thing to know would be the performance data between commits starting from 9b4a15b and ending to 9cbc404 .

That is the thing which will help in understanding where the decreases happened. Since there has been many commits with optimizations and performance increases, it makes no sense that gpt4all/alpaca/llama-9b4a15b is faster, it should be slower because they don't have any of the recent optimizations. That leads to only one conclusion that there must have been significant decreases at some points in the timeline. It can be something not easily seen like compiler dropping inlining because of some change (inline and force-inline isn't same, and compiler can even drop force-inline) or a mistake somewhere, cannot know really. Only data can save us. 😄

@cyyynthia
Copy link
Collaborator

I've been able to observe some performance degradation on Arch Linux as well. I didn't have time to look for the precise commit yet, but I found the potentially helpful information that the degradation seemed to have happened after the ggml format migration, which may help simplifying the exploration. I think it would be nice if someone else could confirm this and make sure this isn't something that happens for me only 😅

I'll keep doing some exploration, but here are the numbers I observed so far:

System info: Arch Linux - CPU: Intel Core i7 7700K. Compiled using make.
n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0
All of the runs were run with the same parameters, using the 7B model (Q4_0)

I only use 4 threads, as 8 threads tend to cause performance degradation for me.

ed3c680

llama_print_timings:        load time =  3013.13 ms
llama_print_timings:      sample time =   277.93 ms /   512 runs   (    0.54 ms per run)
llama_print_timings: prompt eval time =  1990.05 ms /    14 tokens (  142.15 ms per token)
llama_print_timings:        eval time = 131235.49 ms /   511 runs   (  256.82 ms per run)
llama_print_timings:       total time = 137390.49 ms

074bea2 (first commit using new format)

main:     load time =  1953.12 ms
main:   sample time =   248.32 ms
main:  predict time = 102235.23 ms / 194.73 ms per token
main:    total time = 105018.24 ms

9b4a15b

main:     load time =  1385.53 ms
main:   sample time =   247.39 ms
main:  predict time = 101518.52 ms / 193.00 ms per token
main:    total time = 103729.24 ms

@x02Sylvie
Copy link

This might come in handy for tech savvy lads here who need slight performance boost #295
you may need to modify code a bit to make it work on latest commits however

@anzz1
Copy link
Contributor

anzz1 commented Mar 30, 2023

Thank you @cyyynthia , something is definitely up here.

Interestingly, the new format made load times go up 40% but the sampling and predict times stayed the same (within margin of error)

I've only now woken up to this since you don't tend to see marginal changes (like in general, in anything) as I've always been on the latest version and didn't notice the performance degrading gradually. But obviously now everything is much slower, loading/sampling/prompt evaluation, and this is a high priority issue. For anyone trying out gpt4all/alpaca.cpp vs current-gen llama.cpp will find it painfully obvious while for someone just developing incrementally this has gone by unnoticed.

@cyyynthia
Copy link
Collaborator

cyyynthia commented Mar 30, 2023

After a bit more digging, #439 seems to be a very clear culprit of performance drops: eval time goes from 175ms (better than before!) @ 404e1da to 244ms @ 483bab2.

It seems other timings do fluctuate in some intriguing ways, with increased loading times and sample times. I'll try to put together a test script and plot the evolution of these values over time on my machine.

@gjmulder gjmulder added the performance Speed related topics label Mar 30, 2023
@cyyynthia
Copy link
Collaborator

I've done a first test to see, and I've already gathered some interesting data. I have ran my script on 18 commits from the range cited earlier, skipping 10 commits every time. Doing it on the full range will take a while so I wanted to see what I could get without too much precision.

Here are the following graphs I've gathered, and the raw csv data if you want to explore it some more. I'll run it on all 180 commits later, probably tomorrow.

result.csv

execution time
eval token time
memory usage

@sw
Copy link
Collaborator

sw commented Apr 4, 2023

@cyyynthia : it's great that you're putting in all this work

Had a go at reverting #439 and #500, don't know if that would help you: sw@f747a43 (edit: some mistakes hopefully fixed: sw@f6d83dc)

It's faster than latest master at first glance, but I really haven't tested it thoroughly.

@cyyynthia
Copy link
Collaborator

It's faster than latest master at first glance, but I really haven't tested it thoroughly.

@sw Oh yes it is! Only 333s runtime! 107ms per token on average, this is by far the fastest run of them all!! Here is the graph (with only interesting commits):
image

And the raw CSV data:
token_times.csv

@MillionthOdin16
Copy link
Author

MillionthOdin16 commented Apr 4, 2023

@cyyynthia
Okay, wow. Can you summarize that graph so I can make sure I'm interpreting it correctly? The vertical axis is time in seconds and x is token count in the context? Is this a fixed test that measures ms per token, and the difference is that great?

relevant commits (for context tooltips)
437e778, 4870e45, 404e1da, sw/llama.cpp@f747a43

Also thanks to the awesome people who are divining into performance analysis! Really appreciate the effort by all of you <3

@cyyynthia
Copy link
Collaborator

cyyynthia commented Apr 4, 2023

The vertical axis is in microseconds, and the horizontal axis is the number of tokens. I ran main with set parameters and let it generate 2000 tokens. For each token, I logged the time it took to generate it and plotted it on the graph.

Each line is a single test on a specific commit (single run). The amount of times it takes to generate a single token grows over time as there are more and more things in the context store to care about. The yellow line represents when #439 was introduced, where we can see the time it takes to generate a single token grows much faster than before (the green line). The blue line is current master branch, and we can see the regression is still very pronounced.

The red line is sw' patch which is master with #439 (and #500) reverted. We can see the time it takes to generate a single token is much lower, and grows much slower (exactly like it used to do). We can also see there is an overall performance increase, from all the SIMD optimizations that have been done in the past days.

I hope this explains it well enough!

edit: for the reference, here are the total run times for all runs (from newer to oldest):
sw/llama.cpp@f747a43: 333s
437e778: 1448s
4870e45: 1673s
404e1da: 415s
d5850c5: 406s
da0e9fe: 411s
9b4a15b: 411s

@MillionthOdin16
Copy link
Author

That's awesome! I always wonder what the actual performance increase is when I see commits for with additional optimizations.

This plot is super useful for evaluating performance. I wonder if we can turn it into a script that runs for each new commit / publish as a workflow both as a performance sanity check and just because it's cool to see the progress :)

Just an idea for the future. But awesome job!!!

@ivanstepanovftw
Copy link
Collaborator

I did not notice any performance improvement for sw@f747a43 =.=
Slightly worse perplexity and OpenBLAS broken.

sw@f747a43:

main: seed = 1680654638
llama_model_load: loading model from 'models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from 'models/7B/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks (335687 tokens, 512 n_ctx)
33.30 seconds per pass - ETA 6.06 hours
[1]4.5738,[2]5.1249,[3]5.9216,[4]6.6188,[5]6.7176,[6]6.7239,[7]6.9237,[8]7.0217,^C

Main:

28.09 seconds per pass - ETA 5.11 hours
[1]4.5619,[2]5.1787,[3]6.0491,[4]6.7494,[5]6.7852,[6]6.7668,[7]6.9845,

@cyyynthia
Copy link
Collaborator

cyyynthia commented Apr 5, 2023

I did not notice any performance improvement for sw@f747a43 =.=

I'm highly unsure if perplexity is a meaningful test for measuring performance, at least here. As I've illustrated in my graphs, at the beginning the token time is fine so long only very few tokens are in the context, and only as the context fills up the degradation shows up and we get further and further away from the baseline. At 512 tokens, my data shows 2x slowdown in token time and at 2000 tokens a 7.5x slowdown.

I think the fact OpenBLAS being broken is sort of expected, as the point of the branch was to see if the same degradation of token time was occurring if we reverted the offending commit but kept everything else - and according to my data the degradation completely vanishes - and not to find a proper fix for the degradation yet.

edit: I'm even more convinced that perplexity and the shown ETA is not a good test at all, as it seems the measurement is done on an empty context (see here), meaning it is totally blind to the core of this issue and what my analysis points out.

@MillionthOdin16
Copy link
Author

MillionthOdin16 commented Apr 5, 2023

I see a significant improvement for sw@f747a43 . My experience on windows matches cyyynthia's data. I didn't realize how bad it had gotten until I ran the reverted version and generation was so much faster. Thanks :)

@sw
Copy link
Collaborator

sw commented Apr 5, 2023

Glad to see that this shows an improvement for some of you. I must admit I just quickly threw this together. I haven't looked at OpenBLAS at all so I apologize for that being broken. (edit: can't get OpenBLAS to link right away, but some mistakes are hopefully fixed here: sw@f6d83dc)

Reverting #439 is straightforward, but my second commit which purports to revert #500 is a bit disingenious what the commit description is concerned. It certainly deserves further scrutiny, for example the int->int64_t changes from #626 are likely incomplete, which may break large tensors (as seen in #599).

Also, my intention is certainly not to have this slapped back onto master without further investigation. The intentions behind #439 and #500, namely reducing the complexity of the code base, were good after all.

@KASR
Copy link
Contributor

KASR commented Apr 5, 2023

@ivanstepanovftw I agree with @cyyynthia we first need to confirm the root cause of the performance degradation, right now it does appear to be #439. On a side note, it has been discussed that perplexity is improved when using blas (see #406 for more info) so it's difficult to compare the results with/without blas.

I also agree with @sw --> ggml_cpy was introduced for a reason. To quote ggerganov : "Seems like the "transposed X" branch is more efficient, but not numerically stable. Will try to see if I can resolve it and if not, I will simply remove it all together from ggml." and indeed before 439 the inference results differ when changing e.g. the number of threads/batch size etc.

@sw depending on the number of tokens, especially at n=2048 the improvements are very big (at least on windows when using cmake and vs to build, can't say anything for other configurations) i.e. this has a significant impact during interactive mode

@cyyynthia
Copy link
Collaborator

this has a significant impact during interactive mode

I believe this is incorrect. It's not the -n that matters, it's how many things are in the context memory (i.e. -n_ctx and how far we are in the generation/interaction). On the revert branch, I've had significantly faster responses in interactive mode on the 13B model. I don't have data to back all this up, but I'm pretty sure the impact is the same in interactive mode.

@ggerganov
Copy link
Owner

Thank you for this hard work - I missed this regression because I rarely run generations with more than a few tens of tokens. The problem is that the transpose operation for the V matrix is very slow and becomes slower and slower with every new token added.

I think I have provided a fix here: #775

Tested only on M1 so far

@sw
Copy link
Collaborator

sw commented Apr 5, 2023

#775 was merged pretty quickly, @cyyynthia if it's not too much trouble, could you update your pretty graphs with that?

@ggerganov ggerganov reopened this Apr 5, 2023
@MillionthOdin16
Copy link
Author

Also @cyyynthia is there a way that we can adapt your process as a way to analyze future builds?

Do you have a script or something that you were using that could help us create something like that?

People have done some pretty cool stuff and I want to make sure it isn't lost once the issue is resolved.

@cyyynthia
Copy link
Collaborator

#775 was merged pretty quickly, cyyynthia if it's not too much trouble, could you update your pretty graphs with that?

@sw The graphs aren't super pretty this time because I didn't take the time to properly close everything and had a bunch of things open in background while the test was running 😅

That being said, the regression appears to be gone. 🎉 Here's the graphs and the raw CSV:
image
token_times.csv

Do you have a script or something that you were using that could help us create something like that?

@MillionthOdin16 As described earlier, my process was simply to shoving a fprintf(stderr, "debug_token_time: %ld\n", ggml_time_us() - t_start_us); in the llama_eval_internal function, and dumping the numbers it spat in a CSV. Nothing too complex 😅

It could be added to llama.cpp itself behind a compile flag for enabling it more easily. But it's not rocket science, and I didn't use any special script for it.

@MillionthOdin16
Copy link
Author

Okay, thanks for summarizing the process.

That graph makes me so happy ❤️

@sw
Copy link
Collaborator

sw commented Apr 5, 2023

Well I guess that settles it. The Great Wizard Georgi has saved the day! Thanks to @cyyynthia and @KASR for putting in the hard work of tracking this down.

I have opened #790 to track the discrepancy in the different partial times vs total time. I think this issue could be closed. Thanks everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

Successfully merging a pull request may close this issue.