fix(perf/UX): Use num physical cores by default, warn about E/P cores #934

jon-chuang · 2023-04-13T05:01:35Z

Fixes: #932

Hyperthreading is bad, probably because we are compute bound (not memory bound).

See also: #34

Notes: I consulted GPT4 in the making of this PR.

examples/common.cpp

sw · 2023-04-13T08:27:57Z

I originally wrote the code parsing /proc/cpuinfo without having access to a wide variety of machines. It's good that you make the effort to improve this. How do the various methods compare to simply using std::thread::hardware_concurrency?

As for code style, this should probably be moved out of gpt_params_parse, as it's not really about parsing the commandline. There's some more logic in the header file for n_threads, it would be nice to have this in one place.

jon-chuang · 2023-04-13T08:50:11Z

without having access to a wide variety of machines.

This hasn't been tested on darwin or windows. I would appreciate CI or someone being able to test.

How do the various methods compare to simply using std::thread::hardware_concurrency?

For Linux, I get the number of physical cores, rather than the logical cores provided by std::thread::hardware_concurrency. Perf (ms per token) is 1.5-2x better, which is a much better default.

So previously, it used 16/16 hyper-threaded threads. Now it uses 8/16.

There's some more logic in the header file for n_threads, it would be nice to have this in one place.

I guess you mean here. I saw that too. Should we change all the logic to be in the header?

llama.cpp/examples/common.h

Line 18 in e7f6997

    
           int32_t n_threads     = std::min(4, (int32_t) std::thread::hardware_concurrency());

sw · 2023-04-13T09:10:15Z

Should we change all the logic to be in the header?

This is better kept in common.cpp. Maybe initialize the field to 0 or -1. Then move your code for determining the default into its own function and call that from gpt_params_parse and gpt_print_usage? But that will break if a program uses common.h but doesn't call these functions.

jon-chuang · 2023-04-13T09:20:09Z

This is better kept in common.cpp.

These pertain to the struct default, so, I suggest adding a function in the header, with impl in .cpp that contains this logic.

Function name is get_default_physical_cpu_cores()

prusnak · 2023-04-13T10:33:27Z

Btw, sysctl hw.physicalcpu returns 8 on M1, but you want to use only 4 threads, because M1 contains 4 high-performance cores and 4 low-performance cores.

KASR · 2023-04-13T11:25:00Z

just as an fyi: i did some benchmark test for another issue ( see #603 (comment) )

this was done on a Xeon W 2295 having 18 physical cores. However, at those benchmarks the performance was best either a bit below the number of physical cores or a bit above it. Also performance did increase when using more threads then the number of physical cores. So the settings will probably be system depended.

Perhaps it's an idea to include a benchmark script so that users can test the on their own system and determine the performance as function of system settings?

jon-chuang · 2023-04-13T12:01:03Z

the performance was best either a bit below the number of physical cores or a bit above it.

The best performance is not the aim.

It's to provide a reasonable default. 2X slower than optimal is not reasonable (as was the result of running on all logical cores with hyperthreading on), but 10% slower is still reasonable. My guess is that num physical cores gets within 10% of optimal.

A bench script to loop through different n_threads and report results would definitely be a nice orthogonal improvement, and we could include it in the main README so some users actually get to using it.

jon-chuang · 2023-04-13T12:04:00Z

Btw, sysctl hw.physicalcpu returns 8 on M1, but you want to use only 4 threads, because M1 contains 4 high-performance cores and 4 low-performance cores.

Hmm I see. There was a separate discussion about this type of Arch. Let me replicate the conclusions here. (I don't think there is a super nice solution in this case).

I also wonder if the same thing applies to the new Intel CPUs with E and P cores.

Are you able to report some rough results of running 8 v.s. 4 cores in terms of the ms per token for inference mode?

prusnak · 2023-04-13T12:29:14Z

Are you able to report some rough results of running 8 v.s. 4 cores in terms of the ms per token for inference mode?

7B 4 threads => 80ms/token
7B 8 threads => 167ms/token

13B 4 threads => 167ms/token
13B 8 threads => 320ms/token

So it's roughly 2x slower when using 8 cores (lo+hi) instead of 4 cores (hi only).

prusnak · 2023-04-13T12:32:33Z

Btw, this is the output of sysctl -a | grep hw.perflevel on my M1:

hw.perflevel0.physicalcpu: 4
hw.perflevel0.physicalcpu_max: 4
hw.perflevel0.logicalcpu: 4
hw.perflevel0.logicalcpu_max: 4
hw.perflevel0.l1icachesize: 196608
hw.perflevel0.l1dcachesize: 131072
hw.perflevel0.l2cachesize: 12582912
hw.perflevel0.cpusperl2: 4
hw.perflevel0.name: Performance
hw.perflevel1.physicalcpu: 4
hw.perflevel1.physicalcpu_max: 4
hw.perflevel1.logicalcpu: 4
hw.perflevel1.logicalcpu_max: 4
hw.perflevel1.l1icachesize: 131072
hw.perflevel1.l1dcachesize: 65536
hw.perflevel1.l2cachesize: 4194304
hw.perflevel1.cpusperl2: 4
hw.perflevel1.name: Efficiency

So we can detect the number of performance cores via hw.perflevel0.physicalcpu.

KASR · 2023-04-13T12:37:24Z

the performance was best either a bit below the number of physical cores or a bit above it.

The best performance is not the aim.

It's to provide a reasonable default. 2X slower than optimal is not reasonable (as was the result of running on all logical cores with hyperthreading on), but 10% slower is still reasonable. My guess is that num physical cores gets within 10% of optimal.

A bench script to loop through different n_threads and report results would definitely be a nice orthogonal improvement, and we could include it in the main README so some users actually get to using it.

I've uploaded the python script that i use as a gist --> benchmark_threads_llama_cpp.py, feel free to include it in your pr if you want to

jon-chuang · 2023-04-13T12:37:26Z

Btw, this is the output of sysctl -a | grep hw.perflevel on my M1:

Anyone have a non-M1/M2 mac? What is the result of grepping perflevel?

prusnak · 2023-04-13T12:39:50Z

Anyone have a non-M1 mac? What is the result of grepping perflevel?

I confirmed that hw.perflevel0.physicalcpu exists on Intel iMac and Intel Macbook too. So we can use that first. If the value is not available we can fallback to hw.physicalcpu.

Suggestion for the code:

    int result = sysctlbyname("hw.perflevel0.physicalcpu", &num_physical_cores, &len, NULL, 0);
    if (result == 0) {
        params.n_threads = num_physical_cores;
    } else {
        result = sysctlbyname("hw.physicalcpu", &num_physical_cores, &len, NULL, 0);
        if (result == 0) {
            params.n_threads = num_physical_cores;
        }
    }

jon-chuang · 2023-04-13T12:40:33Z

I've uploaded the python script that i use as a gist

I think we want to modify this by assuming a single global optimum and doing halving so we only need log n_cpu steps rather than n_cpu steps.

We should start with the default and then go up by a quarter step. There should also be a short warm up step of -n 16 or something.

ggerganov · 2023-04-13T14:04:59Z

I'll suggest that after all the checks, we clamp the default number of threads to maximum of 8 because there is almost never a reason to go beyond that I think

examples/common.cpp

jon-chuang · 2023-04-13T18:36:28Z

we clamp the default number of threads to maximum of 8

Not for @KASR 's case though #603 (comment)

I think we should let the benchmark script speak for itself.

If we cannot get the physical cores, we will use max(1, min(8, hardware_concurrency)) though

…jon/use-hardware-cores

examples/common.cpp

jon-chuang · 2023-04-13T19:58:01Z

@MillionthOdin16 would you be able to check it this works on windows for you? (does it show num_physical_cores as the default when running lamma.cpp)

prusnak · 2023-04-15T17:25:01Z

In order not to result in poor default performance for corner cases like E/P cores, I decided to clip to max 4 for default, and warn the user if the default of physical cores has been clipped, with a red WARNING.

The output belongs to stderr, not stdout (use cerr). Also the red color is not needed imho and can cause some trouble on non-standard terminals (windows).

ivanstepanovftw · 2023-04-15T19:30:15Z

Why clip default to 4?

ivanstepanovftw · 2023-04-15T19:33:47Z

examples/common.cpp

+        if (line.find("cpu cores") != std::string::npos) {
+            line.erase(0, line.find(": ") + 2);
+            try {
+                return (int32_t) std::stoul(line);


Why is there .erase applied to whole string? Why no checking find for result?

I'm not sure what you mean. What do you suggest?

jon-chuang · 2023-04-16T00:19:11Z

Why clip default to 4?

Refer to discussion above.

sw · 2023-04-16T08:31:38Z

examples/common.cpp

+#elif defined(_WIN32)
+    SYSTEM_INFO sysinfo;
+    GetNativeSystemInfo(&sysinfo);
+    return (in32_t) sysinfo.dwNumberOfProcessors;


With this typo and the problems with the externs above (see CI checks), this won't build on Windows, so I assume it hasn't been tested at all on Windows?

FNsi · 2023-04-17T15:17:20Z

Maybe better to use something check cpu type before.

DannyDaemonic · 2023-04-18T12:21:24Z

I just wanted to jump in to say that on an AMD EPYC 7551P 32-Core Processor with 64 threads, I still get the best performance running with 32 threads. I don't mind the default being capped at 4 or 8 threads since I'm already in the habit of using -t 32 when I run it, but I do think for the vast majority of systems, you will get the best performance with one thread per physical core.

If the default number of cores is capped at 4 or 8, it may be helpful to have an option that sets the thread count equal to the number of cores. This would simplify online advice about optimizing performance by eliminating the need to explain the difference between physical cores and threads or processors in a system and give those less technically savvy a different option to try rather than taking a blind guess at the ideal number of threads.

Perhaps when the user gives 0 or -1. ie -t -1.

prusnak · 2023-04-18T12:51:55Z

Thinking about this more ...

Since we can reasonably well detect number of physical cores on Linux and macOS, I don't think we should be clamping the number of cores to 4.

For Windows, we can reliably detect number of physical cores with GetLogicalProcessorInformation. Documentation is here: https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformation

The code produced by GPT (totally untested):

DWORD buffer_size = 0;
DWORD result = GetLogicalProcessorInformation(NULL, &buffer_size);
// assert result == FALSE && GetLastError() == ERROR_INSUFFICIENT_BUFFER
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)malloc(buffer_size);
result = GetLogicalProcessorInformation(buffer, &buffer_size);
if (result != FALSE) {
    int num_physical_cores = 0;
    DWORD_PTR byte_offset = 0;
    while (byte_offset < buffer_size) {
        if (buffer->Relationship == RelationProcessorCore) {
            num_physical_cores++;
        }
        byte_offset += sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION);
        buffer++;
    }
    std::cout << "Number of physical cores: " << num_physical_cores << std::endl;
} else {
    std::cerr << "Error getting logical processor information: " << GetLastError() << std::endl;
}
free(buffer);

jon-chuang · 2023-04-26T14:13:08Z

Hmm, I'm still worried about the E/P core edge case, but perhaps a warning for this will suffice.

As for the Windows one, I tried to install a virtual machine, but I'm unfamiliar/uninterested in enough in setting up my dev environment on windows to continue investigating; thus I will use a naive default of 4 for now, issue a warning for lack of calibration on windows, and someone who is interested and has access to a windows machine can impl the GetLogicalProcessorInformation method.

…jon/use-hardware-cores

examples/common.cpp

examples/main/README.md

examples/common.h

examples/common.cpp

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>

…jon/use-hardware-cores

jon-chuang · 2023-04-30T10:18:03Z

Thanks @DannyDaemonic , the PR is looking much better now. Hope someone could take a look and give the final sign-off if it is ready.

DannyDaemonic · 2023-04-30T10:48:34Z

I hope it gets in, the threading issue has really been hanging in limbo for a while now.

I'm curious if they'll let this in without a full Windows implementation. If not, I can can provide one in the next day or two.

jon-chuang added 2 commits April 13, 2023 12:55

commit

1caa4dc

fix

f181c28

jon-chuang commented Apr 13, 2023

View reviewed changes

examples/common.cpp Outdated Show resolved Hide resolved

try-catch

8694318

jon-chuang changed the title ~~fix(params): Use hardware cores by default~~ fix(params): Use num hardware cores by default Apr 13, 2023

jon-chuang changed the title ~~fix(params): Use num hardware cores by default~~ fix(params): Use num physical cores by default Apr 13, 2023

prusnak requested changes Apr 13, 2023

View reviewed changes

examples/common.cpp Outdated Show resolved Hide resolved

prusnak reviewed Apr 13, 2023

View reviewed changes

examples/common.cpp Outdated Show resolved Hide resolved

jon-chuang added 3 commits April 14, 2023 03:07

apply code review

e032535

Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …

b17d54e

…jon/use-hardware-cores

improve

02b0fe8

prusnak requested changes Apr 13, 2023

View reviewed changes

examples/common.cpp Outdated Show resolved Hide resolved

improve

3fa8837

jon-chuang changed the title ~~fix(UX): Use num physical cores by default, clip default to 4, and warn about clipped default~~ fix(perf/UX): Use num physical cores by default, clip default to 4, and warn about clipped default Apr 15, 2023

remove color

1a6c8cf

ivanstepanovftw reviewed Apr 15, 2023

View reviewed changes

sw reviewed Apr 16, 2023

View reviewed changes

jon-chuang added 2 commits April 16, 2023 21:19

fix windows

9ee4719

minor

df2d350

mqy mentioned this pull request Apr 17, 2023

Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k) #842

Closed

sw mentioned this pull request Apr 23, 2023

Added README.md for main with examples and explanations #1139

Merged

jon-chuang added 2 commits April 26, 2023 22:21

Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …

42c297b

…jon/use-hardware-cores

fix

4a98a0f

jon-chuang changed the title ~~fix(perf/UX): Use num physical cores by default, clip default to 4, and warn about clipped default~~ fix(perf/UX): Use num physical cores by default, warn about E/P cores Apr 26, 2023

jon-chuang mentioned this pull request Apr 26, 2023

fix (perf/UX): get physical cores for Windows #1189

Closed

DannyDaemonic suggested changes Apr 27, 2023

View reviewed changes

jon-chuang and others added 5 commits April 30, 2023 18:10

Apply suggestions from code review

710c4bb

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>

remove

f1c19d8

Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …

496d291

…jon/use-hardware-cores

minor

2fbc90f

minor

78761b1

ggerganov approved these changes Apr 30, 2023

View reviewed changes

ggerganov merged commit a5d30b1 into ggerganov:master Apr 30, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Improve documentation for server chat formats (ggerganov#934)

5a9770a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(perf/UX): Use num physical cores by default, warn about E/P cores #934

fix(perf/UX): Use num physical cores by default, warn about E/P cores #934

jon-chuang commented Apr 13, 2023 •

edited

Loading

sw commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

sw commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023

prusnak commented Apr 13, 2023 •

edited

Loading

KASR commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

prusnak commented Apr 13, 2023

prusnak commented Apr 13, 2023 •

edited

Loading

KASR commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 •

edited

Loading

prusnak commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 •

edited

Loading

ggerganov commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023

prusnak commented Apr 15, 2023

ivanstepanovftw commented Apr 15, 2023

ivanstepanovftw Apr 15, 2023

jon-chuang Apr 16, 2023

jon-chuang commented Apr 16, 2023

sw Apr 16, 2023

FNsi commented Apr 17, 2023

DannyDaemonic commented Apr 18, 2023 •

edited

Loading

prusnak commented Apr 18, 2023

jon-chuang commented Apr 26, 2023 •

edited

Loading

jon-chuang commented Apr 30, 2023

DannyDaemonic commented Apr 30, 2023

fix(perf/UX): Use num physical cores by default, warn about E/P cores #934

fix(perf/UX): Use num physical cores by default, warn about E/P cores #934

Conversation

jon-chuang commented Apr 13, 2023 • edited Loading

sw commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023 • edited Loading

sw commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023

prusnak commented Apr 13, 2023 • edited Loading

KASR commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023 • edited Loading

prusnak commented Apr 13, 2023

prusnak commented Apr 13, 2023 • edited Loading

KASR commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 • edited Loading

prusnak commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 • edited Loading

ggerganov commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023

prusnak commented Apr 15, 2023

ivanstepanovftw commented Apr 15, 2023

ivanstepanovftw Apr 15, 2023

Choose a reason for hiding this comment

jon-chuang Apr 16, 2023

Choose a reason for hiding this comment

jon-chuang commented Apr 16, 2023

sw Apr 16, 2023

Choose a reason for hiding this comment

FNsi commented Apr 17, 2023

DannyDaemonic commented Apr 18, 2023 • edited Loading

prusnak commented Apr 18, 2023

jon-chuang commented Apr 26, 2023 • edited Loading

jon-chuang commented Apr 30, 2023

DannyDaemonic commented Apr 30, 2023

jon-chuang commented Apr 13, 2023 •

edited

Loading

sw commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

sw commented Apr 13, 2023 •

edited

Loading

prusnak commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

prusnak commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

DannyDaemonic commented Apr 18, 2023 •

edited

Loading

jon-chuang commented Apr 26, 2023 •

edited

Loading