Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(perf/UX): Use num physical cores by default, warn about E/P cores #934

Merged
merged 19 commits into from
Apr 30, 2023

Conversation

jon-chuang
Copy link
Contributor

@jon-chuang jon-chuang commented Apr 13, 2023

Fixes: #932

Hyperthreading is bad, probably because we are compute bound (not memory bound).

See also: #34

Notes: I consulted GPT4 in the making of this PR.

examples/common.cpp Outdated Show resolved Hide resolved
@sw
Copy link
Contributor

sw commented Apr 13, 2023

I originally wrote the code parsing /proc/cpuinfo without having access to a wide variety of machines. It's good that you make the effort to improve this. How do the various methods compare to simply using std::thread::hardware_concurrency?

As for code style, this should probably be moved out of gpt_params_parse, as it's not really about parsing the commandline. There's some more logic in the header file for n_threads, it would be nice to have this in one place.

@jon-chuang jon-chuang changed the title fix(params): Use hardware cores by default fix(params): Use num hardware cores by default Apr 13, 2023
@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

without having access to a wide variety of machines.

This hasn't been tested on darwin or windows. I would appreciate CI or someone being able to test.

How do the various methods compare to simply using std::thread::hardware_concurrency?

For Linux, I get the number of physical cores, rather than the logical cores provided by std::thread::hardware_concurrency. Perf (ms per token) is 1.5-2x better, which is a much better default.

So previously, it used 16/16 hyper-threaded threads. Now it uses 8/16.

There's some more logic in the header file for n_threads, it would be nice to have this in one place.

I guess you mean here. I saw that too. Should we change all the logic to be in the header?

int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());

@sw
Copy link
Contributor

sw commented Apr 13, 2023

Should we change all the logic to be in the header?

This is better kept in common.cpp. Maybe initialize the field to 0 or -1. Then move your code for determining the default into its own function and call that from gpt_params_parse and gpt_print_usage? But that will break if a program uses common.h but doesn't call these functions.

@jon-chuang
Copy link
Contributor Author

This is better kept in common.cpp.

These pertain to the struct default, so, I suggest adding a function in the header, with impl in .cpp that contains this logic.

Function name is get_default_physical_cpu_cores()

@jon-chuang jon-chuang changed the title fix(params): Use num hardware cores by default fix(params): Use num physical cores by default Apr 13, 2023
@prusnak
Copy link
Collaborator

prusnak commented Apr 13, 2023

Btw, sysctl hw.physicalcpu returns 8 on M1, but you want to use only 4 threads, because M1 contains 4 high-performance cores and 4 low-performance cores.

@KASR
Copy link
Contributor

KASR commented Apr 13, 2023

just as an fyi: i did some benchmark test for another issue ( see #603 (comment) )

this was done on a Xeon W 2295 having 18 physical cores. However, at those benchmarks the performance was best either a bit below the number of physical cores or a bit above it. Also performance did increase when using more threads then the number of physical cores. So the settings will probably be system depended.

Perhaps it's an idea to include a benchmark script so that users can test the on their own system and determine the performance as function of system settings?

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

the performance was best either a bit below the number of physical cores or a bit above it.

The best performance is not the aim.

It's to provide a reasonable default. 2X slower than optimal is not reasonable (as was the result of running on all logical cores with hyperthreading on), but 10% slower is still reasonable. My guess is that num physical cores gets within 10% of optimal.

A bench script to loop through different n_threads and report results would definitely be a nice orthogonal improvement, and we could include it in the main README so some users actually get to using it.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

Btw, sysctl hw.physicalcpu returns 8 on M1, but you want to use only 4 threads, because M1 contains 4 high-performance cores and 4 low-performance cores.

Hmm I see. There was a separate discussion about this type of Arch. Let me replicate the conclusions here. (I don't think there is a super nice solution in this case).

I also wonder if the same thing applies to the new Intel CPUs with E and P cores.

Are you able to report some rough results of running 8 v.s. 4 cores in terms of the ms per token for inference mode?

@prusnak
Copy link
Collaborator

prusnak commented Apr 13, 2023

Are you able to report some rough results of running 8 v.s. 4 cores in terms of the ms per token for inference mode?

7B 4 threads => 80ms/token
7B 8 threads => 167ms/token

13B 4 threads => 167ms/token
13B 8 threads => 320ms/token

So it's roughly 2x slower when using 8 cores (lo+hi) instead of 4 cores (hi only).

@prusnak
Copy link
Collaborator

prusnak commented Apr 13, 2023

Btw, this is the output of sysctl -a | grep hw.perflevel on my M1:

hw.perflevel0.physicalcpu: 4
hw.perflevel0.physicalcpu_max: 4
hw.perflevel0.logicalcpu: 4
hw.perflevel0.logicalcpu_max: 4
hw.perflevel0.l1icachesize: 196608
hw.perflevel0.l1dcachesize: 131072
hw.perflevel0.l2cachesize: 12582912
hw.perflevel0.cpusperl2: 4
hw.perflevel0.name: Performance
hw.perflevel1.physicalcpu: 4
hw.perflevel1.physicalcpu_max: 4
hw.perflevel1.logicalcpu: 4
hw.perflevel1.logicalcpu_max: 4
hw.perflevel1.l1icachesize: 131072
hw.perflevel1.l1dcachesize: 65536
hw.perflevel1.l2cachesize: 4194304
hw.perflevel1.cpusperl2: 4
hw.perflevel1.name: Efficiency

So we can detect the number of performance cores via hw.perflevel0.physicalcpu.

@KASR
Copy link
Contributor

KASR commented Apr 13, 2023

the performance was best either a bit below the number of physical cores or a bit above it.

The best performance is not the aim.

It's to provide a reasonable default. 2X slower than optimal is not reasonable (as was the result of running on all logical cores with hyperthreading on), but 10% slower is still reasonable. My guess is that num physical cores gets within 10% of optimal.

A bench script to loop through different n_threads and report results would definitely be a nice orthogonal improvement, and we could include it in the main README so some users actually get to using it.

I've uploaded the python script that i use as a gist --> benchmark_threads_llama_cpp.py, feel free to include it in your pr if you want to

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

Btw, this is the output of sysctl -a | grep hw.perflevel on my M1:

Anyone have a non-M1/M2 mac? What is the result of grepping perflevel?

@prusnak
Copy link
Collaborator

prusnak commented Apr 13, 2023

Anyone have a non-M1 mac? What is the result of grepping perflevel?

I confirmed that hw.perflevel0.physicalcpu exists on Intel iMac and Intel Macbook too. So we can use that first. If the value is not available we can fallback to hw.physicalcpu.

Suggestion for the code:

    int result = sysctlbyname("hw.perflevel0.physicalcpu", &num_physical_cores, &len, NULL, 0);
    if (result == 0) {
        params.n_threads = num_physical_cores;
    } else {
        result = sysctlbyname("hw.physicalcpu", &num_physical_cores, &len, NULL, 0);
        if (result == 0) {
            params.n_threads = num_physical_cores;
        }
    }

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

I've uploaded the python script that i use as a gist

I think we want to modify this by assuming a single global optimum and doing halving so we only need log n_cpu steps rather than n_cpu steps.

We should start with the default and then go up by a quarter step. There should also be a short warm up step of -n 16 or something.

@ggerganov
Copy link
Owner

I'll suggest that after all the checks, we clamp the default number of threads to maximum of 8 because there is almost never a reason to go beyond that I think

examples/common.cpp Outdated Show resolved Hide resolved
examples/common.cpp Outdated Show resolved Hide resolved
@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 13, 2023

we clamp the default number of threads to maximum of 8

Not for @KASR 's case though #603 (comment)

I think we should let the benchmark script speak for itself.


If we cannot get the physical cores, we will use max(1, min(8, hardware_concurrency)) though

examples/common.cpp Outdated Show resolved Hide resolved
@jon-chuang
Copy link
Contributor Author

@MillionthOdin16 would you be able to check it this works on windows for you? (does it show num_physical_cores as the default when running lamma.cpp)

@jon-chuang jon-chuang changed the title fix(UX): Use num physical cores by default, clip default to 4, and warn about clipped default fix(perf/UX): Use num physical cores by default, clip default to 4, and warn about clipped default Apr 15, 2023
@prusnak
Copy link
Collaborator

prusnak commented Apr 15, 2023

In order not to result in poor default performance for corner cases like E/P cores, I decided to clip to max 4 for default, and warn the user if the default of physical cores has been clipped, with a red WARNING.

The output belongs to stderr, not stdout (use cerr). Also the red color is not needed imho and can cause some trouble on non-standard terminals (windows).

@ivanstepanovftw
Copy link
Collaborator

Why clip default to 4?

Comment on lines 38 to 41
if (line.find("cpu cores") != std::string::npos) {
line.erase(0, line.find(": ") + 2);
try {
return (int32_t) std::stoul(line);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there .erase applied to whole string? Why no checking find for result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. What do you suggest?

@jon-chuang
Copy link
Contributor Author

Why clip default to 4?

Refer to discussion above.

#elif defined(_WIN32)
SYSTEM_INFO sysinfo;
GetNativeSystemInfo(&sysinfo);
return (in32_t) sysinfo.dwNumberOfProcessors;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this typo and the problems with the externs above (see CI checks), this won't build on Windows, so I assume it hasn't been tested at all on Windows?

@FNsi
Copy link
Contributor

FNsi commented Apr 17, 2023

Maybe better to use something check cpu type before.

@DannyDaemonic
Copy link
Contributor

DannyDaemonic commented Apr 18, 2023

I just wanted to jump in to say that on an AMD EPYC 7551P 32-Core Processor with 64 threads, I still get the best performance running with 32 threads. I don't mind the default being capped at 4 or 8 threads since I'm already in the habit of using -t 32 when I run it, but I do think for the vast majority of systems, you will get the best performance with one thread per physical core.

If the default number of cores is capped at 4 or 8, it may be helpful to have an option that sets the thread count equal to the number of cores. This would simplify online advice about optimizing performance by eliminating the need to explain the difference between physical cores and threads or processors in a system and give those less technically savvy a different option to try rather than taking a blind guess at the ideal number of threads.

Perhaps when the user gives 0 or -1. ie -t -1.

@prusnak
Copy link
Collaborator

prusnak commented Apr 18, 2023

Thinking about this more ...

Since we can reasonably well detect number of physical cores on Linux and macOS, I don't think we should be clamping the number of cores to 4.

For Windows, we can reliably detect number of physical cores with GetLogicalProcessorInformation. Documentation is here: https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformation

The code produced by GPT (totally untested):

DWORD buffer_size = 0;
DWORD result = GetLogicalProcessorInformation(NULL, &buffer_size);
// assert result == FALSE && GetLastError() == ERROR_INSUFFICIENT_BUFFER
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size);
result = GetLogicalProcessorInformation(buffer, &buffer_size);
if (result != FALSE) {
    int num_physical_cores = 0;
    DWORD_PTR byte_offset = 0;
    while (byte_offset < buffer_size) {
        if (buffer->Relationship == RelationProcessorCore) {
            num_physical_cores++;
        }
        byte_offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
        buffer++;
    }
    std::cout << "Number of physical cores: " << num_physical_cores << std::endl;
} else {
    std::cerr << "Error getting logical processor information: " << GetLastError() << std::endl;
}
free(buffer);

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 26, 2023

Hmm, I'm still worried about the E/P core edge case, but perhaps a warning for this will suffice.

As for the Windows one, I tried to install a virtual machine, but I'm unfamiliar/uninterested in enough in setting up my dev environment on windows to continue investigating; thus I will use a naive default of 4 for now, issue a warning for lack of calibration on windows, and someone who is interested and has access to a windows machine can impl the GetLogicalProcessorInformation method.

@jon-chuang jon-chuang changed the title fix(perf/UX): Use num physical cores by default, clip default to 4, and warn about clipped default fix(perf/UX): Use num physical cores by default, warn about E/P cores Apr 26, 2023
examples/common.cpp Outdated Show resolved Hide resolved
examples/common.cpp Outdated Show resolved Hide resolved
examples/common.cpp Outdated Show resolved Hide resolved
examples/main/README.md Outdated Show resolved Hide resolved
examples/common.h Outdated Show resolved Hide resolved
examples/common.cpp Outdated Show resolved Hide resolved
examples/common.cpp Show resolved Hide resolved
@jon-chuang
Copy link
Contributor Author

Thanks @DannyDaemonic , the PR is looking much better now. Hope someone could take a look and give the final sign-off if it is ready.

@DannyDaemonic
Copy link
Contributor

I hope it gets in, the threading issue has really been hanging in limbo for a while now.

I'm curious if they'll let this in without a full Windows implementation. If not, I can can provide one in the next day or two.

@ggerganov ggerganov merged commit a5d30b1 into ggerganov:master Apr 30, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf: Investigate performance discrepancy with llama-rs - 1.5x-2x slower
9 participants