-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(perf/UX): Use num physical cores by default, warn about E/P cores #934
fix(perf/UX): Use num physical cores by default, warn about E/P cores #934
Conversation
I originally wrote the code parsing As for code style, this should probably be moved out of |
This hasn't been tested on darwin or windows. I would appreciate CI or someone being able to test.
For Linux, I get the number of physical cores, rather than the logical cores provided by So previously, it used 16/16 hyper-threaded threads. Now it uses 8/16.
I guess you mean here. I saw that too. Should we change all the logic to be in the header? Line 18 in e7f6997
|
This is better kept in common.cpp. Maybe initialize the field to 0 or -1. Then move your code for determining the default into its own function and call that from |
These pertain to the struct default, so, I suggest adding a function in the header, with impl in .cpp that contains this logic. Function name is |
Btw, |
just as an fyi: i did some benchmark test for another issue ( see #603 (comment) ) this was done on a Xeon W 2295 having 18 physical cores. However, at those benchmarks the performance was best either a bit below the number of physical cores or a bit above it. Also performance did increase when using more threads then the number of physical cores. So the settings will probably be system depended. Perhaps it's an idea to include a benchmark script so that users can test the on their own system and determine the performance as function of system settings? |
The best performance is not the aim. It's to provide a reasonable default. 2X slower than optimal is not reasonable (as was the result of running on all logical cores with hyperthreading on), but 10% slower is still reasonable. My guess is that num physical cores gets within 10% of optimal. A bench script to loop through different n_threads and report results would definitely be a nice orthogonal improvement, and we could include it in the main README so some users actually get to using it. |
Hmm I see. There was a separate discussion about this type of Arch. Let me replicate the conclusions here. (I don't think there is a super nice solution in this case). I also wonder if the same thing applies to the new Intel CPUs with E and P cores. Are you able to report some rough results of running 8 v.s. 4 cores in terms of the ms per token for inference mode? |
7B 4 threads => 80ms/token 13B 4 threads => 167ms/token So it's roughly 2x slower when using 8 cores (lo+hi) instead of 4 cores (hi only). |
Btw, this is the output of
So we can detect the number of performance cores via |
I've uploaded the python script that i use as a gist --> benchmark_threads_llama_cpp.py, feel free to include it in your pr if you want to |
Anyone have a non-M1/M2 mac? What is the result of grepping perflevel? |
I confirmed that Suggestion for the code: int result = sysctlbyname("hw.perflevel0.physicalcpu", &num_physical_cores, &len, NULL, 0);
if (result == 0) {
params.n_threads = num_physical_cores;
} else {
result = sysctlbyname("hw.physicalcpu", &num_physical_cores, &len, NULL, 0);
if (result == 0) {
params.n_threads = num_physical_cores;
}
} |
I think we want to modify this by assuming a single global optimum and doing halving so we only need log n_cpu steps rather than n_cpu steps. We should start with the default and then go up by a quarter step. There should also be a short warm up step of -n 16 or something. |
I'll suggest that after all the checks, we clamp the default number of threads to maximum of 8 because there is almost never a reason to go beyond that I think |
Not for @KASR 's case though #603 (comment) I think we should let the benchmark script speak for itself. If we cannot get the physical cores, we will use |
…jon/use-hardware-cores
@MillionthOdin16 would you be able to check it this works on windows for you? (does it show num_physical_cores as the default when running lamma.cpp) |
The output belongs to stderr, not stdout (use cerr). Also the red color is not needed imho and can cause some trouble on non-standard terminals (windows). |
Why clip default to 4? |
examples/common.cpp
Outdated
if (line.find("cpu cores") != std::string::npos) { | ||
line.erase(0, line.find(": ") + 2); | ||
try { | ||
return (int32_t) std::stoul(line); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there .erase
applied to whole string? Why no checking find
for result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean. What do you suggest?
Refer to discussion above. |
examples/common.cpp
Outdated
#elif defined(_WIN32) | ||
SYSTEM_INFO sysinfo; | ||
GetNativeSystemInfo(&sysinfo); | ||
return (in32_t) sysinfo.dwNumberOfProcessors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this typo and the problems with the extern
s above (see CI checks), this won't build on Windows, so I assume it hasn't been tested at all on Windows?
Maybe better to use something check cpu type before. |
I just wanted to jump in to say that on an AMD EPYC 7551P 32-Core Processor with 64 threads, I still get the best performance running with 32 threads. I don't mind the default being capped at 4 or 8 threads since I'm already in the habit of using If the default number of cores is capped at 4 or 8, it may be helpful to have an option that sets the thread count equal to the number of cores. This would simplify online advice about optimizing performance by eliminating the need to explain the difference between physical cores and threads or processors in a system and give those less technically savvy a different option to try rather than taking a blind guess at the ideal number of threads. Perhaps when the user gives 0 or -1. ie |
Thinking about this more ... Since we can reasonably well detect number of physical cores on Linux and macOS, I don't think we should be clamping the number of cores to 4. For Windows, we can reliably detect number of physical cores with The code produced by GPT (totally untested): DWORD buffer_size = 0;
DWORD result = GetLogicalProcessorInformation(NULL, &buffer_size);
// assert result == FALSE && GetLastError() == ERROR_INSUFFICIENT_BUFFER
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size);
result = GetLogicalProcessorInformation(buffer, &buffer_size);
if (result != FALSE) {
int num_physical_cores = 0;
DWORD_PTR byte_offset = 0;
while (byte_offset < buffer_size) {
if (buffer->Relationship == RelationProcessorCore) {
num_physical_cores++;
}
byte_offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
buffer++;
}
std::cout << "Number of physical cores: " << num_physical_cores << std::endl;
} else {
std::cerr << "Error getting logical processor information: " << GetLastError() << std::endl;
}
free(buffer); |
Hmm, I'm still worried about the E/P core edge case, but perhaps a warning for this will suffice. As for the Windows one, I tried to install a virtual machine, but I'm unfamiliar/uninterested in enough in setting up my dev environment on windows to continue investigating; thus I will use a naive default of 4 for now, issue a warning for lack of calibration on windows, and someone who is interested and has access to a windows machine can impl the |
…jon/use-hardware-cores
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
…jon/use-hardware-cores
Thanks @DannyDaemonic , the PR is looking much better now. Hope someone could take a look and give the final sign-off if it is ready. |
I hope it gets in, the threading issue has really been hanging in limbo for a while now. I'm curious if they'll let this in without a full Windows implementation. If not, I can can provide one in the next day or two. |
Fixes: #932
Hyperthreading is bad, probably because we are compute bound (not memory bound).
See also: #34
Notes: I consulted GPT4 in the making of this PR.