-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help test CPUSet patch for Windows and Linux #6927
Comments
It appears something negative is going on between my device & master. (Similar to another PR), this increases speed significantly.
I left new command line arguments as default. Here's my arguments: master:
pr:
|
@Jeximo |
It's Qualcomm SDM855 Snapdragon 855 - Octa-core (1x2.84 GHz Kryo 485 & 3x2.41 GHz Kryo 485 & 4x1.78 GHz Kryo 485)
I tried
Here's a few runs:
It's Android, so core 0 is the biggest. |
Hello @mann1x |
Yes, support for Windows should be quite easy as the information is available on CPUSet.
It would be really nice to work together on this. The hash to stick the tensor affinity as well is exactly what I was thinking about. I did use the The issue with the existing pathway. if I understand it correctly, is that the affinity is set on the calling thread and once the threads are spawned, it's being reset with
Here each thread should be immediately set to the right affinity per core. What do you think?
Ehm no, probably a mistake... where do I have to look? |
This is interesting. Here are my results: PR
master + OpenBLAS
Master is using OpenBLAS though, should I disable it?
PR is selecting only 3/8 cores, but I have 4/8 (physical) cores. |
@arch-btw |
@mann1x ok that helped! PR + -acz 1
|
@arch-btw |
Great job, it selects it automatically now:
|
If there's someone using an Intel CPU with E-Cores, please test that the E-Cores are being excluded from the selection under Windows and Linux. |
Btw, don't test performance with sanitizers enabled: |
Thanks for clarifying. On the other hand, #6832 is no longer building for me: build error log(master)> gh pr checkout #6832 remote: Enumerating objects: 55, done. remote: Counting objects: 100% (47/47), done. remote: Compressing objects: 100% (23/23), done. remote: Total 55 (delta 34), reused 34 (delta 24), pack-reused 8 Unpacking objects: 100% (55/55), 166.26 KiB | 803.00 KiB/s, done. * [new ref] refs/pull/6832/head -> mannix-win32-cpuset Switched to branch 'mannix-win32-cpuset' ~/llama2 (mannix-win32-cpuset) cmake -B build -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod+i8mm && cd build && cmake --build . --config Release --target server --target main -- The C compiler identification is Clang 18.1.4 make[3]: *** [common/CMakeFiles/common.dir/build.make:76: common/CMakeFiles/common.dir/common.cpp.o] Error 1 |
I have no idea why... @ggerganov |
It looks like |
Feature Description
Adding CPUSet and thus a better core selection and usage for llama.cpp
Works on Windows and Linux x64 up to 64 logical cores.
Motivation
Faster, about 10%, and more efficient inference.
Keep the system responsive while using llama.cpp.
Possible Implementation
#6832
Problems addressed:
These command line options have been added:
-bco
: Best Core Order, set to 1 will invert the default order and the cores will be selected from the best to the worst-llct
: Last Level Cache Traversal, set to 1 will allow the core selection to traverse the Last Level cache index-acz
: Allow Core Zero, set to 1 will allow selection of Core 0-atc
: Allow Threaded Cores, set to 1 will allow selection of threaded, non physical cores-ccm
: Custom Cpu Mask, allow setting a custom cpu affinity bitmask as integerPlease test if the default settings are followed or not and if the options are behaving as expected.
In particular test if the E-Cores on Intel are correctly detected and disabled.
Make a comparison of the speed in prompt and eval against the
master
branch and report the results.Thanks
The text was updated successfully, but these errors were encountered: