Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement get_num_physical_cores() for Windows #1278

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

DannyDaemonic
Copy link
Collaborator

@DannyDaemonic DannyDaemonic commented May 2, 2023

This implements get_num_physical_cores on Windows. It uses GetLogicalProcessorInformationEx to tease out the number of physical cores and the number of those that are efficiency class 0. This should also address #572 on Windows, but on Windows only. I think something similar will have to be done for the other get_num_physical_cores paths in other OSs.

The only other thing this does it move the default param generation from gpt_params_parse to gpt_print_usage where it's used. Without this gpt_params_parse is called twice. Once for the instantiation in main and once for the instantiation in gpt_params_parse. This isn't terrible but it also means get_num_physical_cores() is called twice no matter what.

This works on all the Windows systems I tested it on, but none of mine have performance cores. According to the documentation, this should work in those cases as well but I would like to verify it with someone who has P/E cores.

This code also prints the number of cores found to stderr and that should be removed once we've verified it works with P/E core CPU systems. Preferably before we commit, but anything is fine.

I'll edit this to keep it up to date, this is the entirety of the logic with the extra debugging lines removed. No lines will need to be added or modified to remove that extra information, so it'll just be a bunch of -s.

    // Call GetLogicalProcessorInformationEx again with the allocated buffer
    if (GetLogicalProcessorInformationEx(
            RelationAll,
            reinterpret_cast<SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *>(buffer),
            &length)) {
        DWORD offset = 0;

        while (offset < length) {
            auto info = reinterpret_cast<SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX *>(buffer + offset);

            if (info->Relationship == RelationProcessorCore) {
                physical_cores += info->Processor.GroupCount;

                for (WORD i = 0; i < info->Processor.GroupCount; ++i) {
                    // Assuming EfficiencyClass 0 represents performance cores, and others represent efficiency cores
                    if (info->Processor.EfficiencyClass == 0) {
                        physical_performance_cores++;
                    }
                }
            }
            offset += info->Size;
        }
    }

    if (physical_performance_cores > 0) {
        return static_cast<int32_t>(physical_performance_cores);
    } else if (physical_cores > 0) {
        return static_cast<int32_t>(physical_cores);
    }

examples/common.cpp Outdated Show resolved Hide resolved
physical_cores += info->Processor.GroupCount;

for (WORD i = 0; i < info->Processor.GroupCount; ++i) {
int core_count = static_cast<int>(__popcnt64(info->Processor.GroupMask[i].Mask));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will __popcnt64 work on other compilers such as mingw? Sadly, std::popcount is C++20.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about mingw. It sounds like it didn't work on older versions of it anyway.

__popcnt64 is used here for counting the logical cores, which I believe to be unnecessary. I can rip out all the logical core stuff if it's working on P/E systems since we only care about the physical cores count on modern CPUs. I have yet to see a single case where hyperthreading helps llama.cpp performance.

examples/common.cpp Outdated Show resolved Hide resolved
examples/common.cpp Outdated Show resolved Hide resolved
@prusnak
Copy link
Sponsor Collaborator

prusnak commented May 2, 2023

to tease out the number of physical cores and the number of those that are efficiency class 0. This should also address #572 on Windows, but on Windows only. I think something similar will have to be done for the other get_num_physical_cores paths in other OSs.

For the record: macOS code path already does that (take only performance cores into account when this info is available).

@anzz1
Copy link
Contributor

anzz1 commented May 4, 2023

As explained in this comment #842 (comment)

The proper way to detect whether you are on the Intel P/E architecture is to use the corresponding machine instructions.

The example function was posted in a comment in the same thread.

//  1 = P core
//  0 = E core
// -1 = fail / not P/E arch
inline int is_thread_on_p_core(void) {
  static unsigned const char a[] = {0x31,0xC9,0xB8,0x1A,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xE8,0x18,0x83,0xF8,0x40,0x0F,0x85,0x06,0x00,0x00,0x00,0xB8,0x01,0x00,0x00,0x00,0xC3,0x83,0xE8,0x20,0xF7,0xD8,0x19,0xC0,0xC3};
  return ((int (*)(void)) (void*)((void*)a))();
}

// 1 = hybrid x86 cpu
inline int is_x86_hybrid_cpu(void) {
  static unsigned const char a[] = {0x31,0xC9,0xB8,0x07,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xEA,0x0F,0x83,0xE2,0x01,0x89,0xD0,0xC3};
  return ((int (*)(void)) (void*)((void*)a))();
}

// 1 = intel 12th/13th gen cpu
inline int is_intel_p_e_core_arch(void) {
  return (is_x86_hybrid_cpu() && (is_thread_on_p_core() >= 0));
}

So testing for if (is_intel_p_e_core_arch()) { ... } is all you need. No need to go through a complex layer of abstractions only to finally arrive to the same exact set of instructions at the end and regurgitating the information back the chain of abstraction, provided that the iteration of the standard library/compiler/platform you are on even has the support for it and it's implemented properly.

When new processors are released and new features added along with new instructions, any compiler that came before it cannot possibly be aware of those instructions. So the cutting-edge stuff will always have spotty standard library, platform and compiler support.

The beauty of machine code is that it is completely insert-anything-here-agnostic, as the instructions are set in stone. Yes, they will be expanded upon, but the previous instructions will keep working no matter what platform or planet you are on or the iteration of a compiler you are working with. It will simply just work™ , no matter what.

By going directly to the source of the information, removing those layers of abstractions, you will also reduce your maintainability burden, as there is no longer any need for different codepaths for different platforms, finding out if the user is on a recent enough platform/kernel version, has the correct versions of the required standard libraries and also happens to have a recent compiler version to even have a support for them.

The same goes for detecting the number of logical/physical cores, which can be tested by checking the SMT/HTT bits and the core bits / logical bits. Implement it correctly, in machine code, once, to never need to maintain it again or deal with any library/platform/compiler issues.

Everything can be literally found in the good ol' RTFM.

The relevant bits of info can be found right here:

Intel ® Architecture Instruction Set Extensions and Future Features "Chapter 1.5 CPUID Instruction"

AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and System Instructions "Appendix D: Instruction Subsets and CPUID Feature Flags"

Or you can probably just ask ChatGPT.

@DannyDaemonic
Copy link
Collaborator Author

@anzz1 Thanks for the information. You remind me of a friend in college who was big on bare metal programming. He was very passionate about getting every ounce out of his hardware, which was always the latest and greatest.

I did notice those functions over at 842, but it seemed to me that they only tell you what our current thread is running on. Since we want the number of non-hyperthreaded, non-efficiency cores, I think we'd also have to figure out the mask for each of the non hyperthreaded cores and loop over it, setting our thread affinity to each (OS call?) and checking that we've switched over before calling is_thread_on_p_core. I'm probably overlooking something and there may be a simpler approach, but unfortunately, working with raw assembly is outside of my comfort zone.

As for this patch, I was just trying to help by filling in a //TODO: Implement that already exists in the codebase for a _WIN32 target. I don't even normally use llama.cpp in Windows, but I've got a lot of experience working with these (admittedly terrible) Windows API calls, so that's the code I submitted. I don't think this needs to be a permanent solution and I wouldn't shed a single tear if my code were replaced with a nice platform independent solution, but unfortunately, this is all I can personally offer.

@slaren
Copy link
Collaborator

slaren commented May 4, 2023

For clarity of the code, I would very much prefer using the Windows API than executing a magic hex string.

@prusnak
Copy link
Sponsor Collaborator

prusnak commented May 4, 2023

For clarity of the code, I would very much prefer using the Windows API than executing a magic hex string.

This. Plus is_thread_on_p_core solves very different problem to what is being solved by this PR.

@anzz1
Copy link
Contributor

anzz1 commented May 5, 2023

I did notice those functions over at 842, but it seemed to me that they only tell you what our current thread is running on. Since we want the number of non-hyperthreaded, non-efficiency cores, I think we'd also have to figure out the mask for each of the non hyperthreaded cores and loop over it, setting our thread affinity to each (OS call?) and checking that we've switched over before calling is_thread_on_p_core. I'm probably overlooking something and there may be a simpler approach, but unfortunately, working with raw assembly is outside of my comfort zone.

You are correct. This is the stonewall I hit too while researching this, it's been a while now but I remember that a note buried somewhere in the docs ended with that you can only determine the P/E status of the currently running thread, with no further clarification. It doesn't help that a lot of the documentation is "old" regarding topology identification, like this one Intel® 64 Architecture Processor Topology Enumeration from January 2018.

It certainly seems a dumb approach having to first figure out the amount of cores and their mask for locking the affinity and starting a thread for each one where you run the cpuid instruction to determine whether it's a P/E core and then count the total.

But I could be wrong, as I couldn't find further information and there very well might be a way to determine them without using any such tricks. I stopped my research there as going by the documentation alone and not having the ability to experiment or validate anything without having the processor wasn't going to cut it.

Definitely someone with the processor should test this code if it works, but I fear that the EfficiencyClass is not going to report the correct value and the P/E thread test mess is going to be needed.

For clarity of the code, I would very much prefer using the Windows API than executing a magic hex string.

Absolutely agree with you here, having to use a magic hex string is tremendously stupid. For that we can blame MSVC, as it doesn't support x86_64 assembly. Any half decent compiler like GCC obviously supports it. MSVC used to too, since VC6 back from 1998, on x86 it did and does support inline asm and there is even __declspec(naked) for writing straight assembly without any compiler decorations to mess with your stuff. For some inexplicable reason none of this was implemented for x86_64 (but was implemented for ARM).

So the "magic hex string" is only there for MSVC. For every other compiler you could just write it in inline asm. As posted in the comment, such a magic string should obviously have the companying assembly written in a comment so it's not a magic string anymore.

Another option is to use compiler-specific intrinsics like MSVC's __cpuid / __cpuidex or GCC's __get_cpuid. Those will translate to the same machine code but it's certainly easier to read and understand. Then again you'd have to write a case for each compiler separately.

So yeah, a magic hex string is definitely stupid, obtuse and unreadable. But from the heap of garbage options, it seems to be the least moldy one.

This. Plus is_thread_on_p_core solves very different problem to what is being solved by this PR.

Yeah this isn't strictly the same thing, the only reason I interjected was the EfficiencyClass part which is related to it. As it looks like the P/E core part is going to need to be implemented with cpuid anyway, a platform-independent function for a num of physical/logical cores is an easy byproduct of it.

Or maybe EfficiencyClass does actually work for determining P/E cores and this is solution works as-is, not platform-independent but certainly easier to read and understand than a magic string. One good thing WinAPI has going for it that if it works once it probably won't need any maintenance, as Microsoft is pretty great at supporting the api funcs for a long time.

EDIT: On a second thought, someone probably already has solved this issue and there already exists x86_64 assembly code for both the physical/logical core count and P/E core count functions as a .S file which could be added to the makefiles, that would also be supported in MSVC via masm and GCC/Clang/others natively.

@sw sw linked an issue May 12, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix (perf/UX): get physical cores for Windows
4 participants