Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help test CPUSet patch for Windows and Linux #6927

Open
mann1x opened this issue Apr 26, 2024 · 15 comments
Open

Help test CPUSet patch for Windows and Linux #6927

mann1x opened this issue Apr 26, 2024 · 15 comments
Labels
enhancement New feature or request stale

Comments

@mann1x
Copy link

mann1x commented Apr 26, 2024

Feature Description

Adding CPUSet and thus a better core selection and usage for llama.cpp
Works on Windows and Linux x64 up to 64 logical cores.

Motivation

Faster, about 10%, and more efficient inference.
Keep the system responsive while using llama.cpp.

Possible Implementation

#6832

Problems addressed:

  • Only uses physical cores
  • Filters out the E-Cores on Intel platforms
  • Sticks to the same Last Layer cache (eg. L3 for AMD Dual CCD processors)
  • Cores are selected based on their scheduler priority (default: worst to best cores)
  • Compute threads are only allocated on the selected cores
  • Disables Windows power management throttling (Power, Timer, Memory)
  • Always excludes Core 0
  • Custom cpu bitmask
  • Optionally include the Core 0 or the threaded ones

These command line options have been added:

  • -bco: Best Core Order, set to 1 will invert the default order and the cores will be selected from the best to the worst
  • -llct: Last Level Cache Traversal, set to 1 will allow the core selection to traverse the Last Level cache index
  • -acz: Allow Core Zero, set to 1 will allow selection of Core 0
  • -atc: Allow Threaded Cores, set to 1 will allow selection of threaded, non physical cores
  • -ccm: Custom Cpu Mask, allow setting a custom cpu affinity bitmask as integer

Please test if the default settings are followed or not and if the options are behaving as expected.
In particular test if the E-Cores on Intel are correctly detected and disabled.

Make a comparison of the speed in prompt and eval against the master branch and report the results.

Thanks

@mann1x mann1x added the enhancement New feature or request label Apr 26, 2024
@Jeximo
Copy link
Contributor

Jeximo commented Apr 26, 2024

It appears something negative is going on between my device & master. (Similar to another PR), this increases speed significantly.

uname -a
Linux localhost 4.14.190-23725627-abG975WVLS8IWD1 #2 SMP PREEMPT Mon Apr 10 18:16:39 KST 2023 aarch64 Android

Cmake instruction:

cmake -B build -DLLAMA_SANITIZE_ADDRESS=ON -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod+i8mm && cd build && cmake --build . --config Release --target main

I left new command line arguments as default. Here's my arguments: ./main -m ~/WaveCoder-Ultra-6.7b.IQ4_XS.gguf --color -ins --interactive-first --penalize-nl --in-suffix "### Response:" --temp 0 -c 2048 -t 4 -b 10 -p "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request."

master:

llama_print_timings:        load time =    9842.17 ms
llama_print_timings:      sample time =      52.57 ms /   274 runs   (    0.19 ms per token,  5212.49 tokens per second)
llama_print_timings: prompt eval time =   48658.72 ms /    62 tokens (  784.82 ms per token,     1.27 tokens per second)
llama_print_timings:        eval time =  105949.40 ms /   273 runs   (  388.09 ms per token,     2.58 tokens per second)
llama_print_timings:       total time =  169384.53 ms /   335 tokens

pr:

llama_print_timings:        load time =    2058.99 ms
llama_print_timings:      sample time =      56.71 ms /   274 runs   (    0.21 ms per token,  4831.26 tokens per second)
llama_print_timings: prompt eval time =   19187.89 ms /    61 tokens (  314.56 ms per token,     3.18 tokens per second)
llama_print_timings:        eval time =  102734.94 ms /   273 runs   (  376.32 ms per token,     2.66 tokens per second)
llama_print_timings:       total time =  130020.72 ms /   334 tokens

load time is nice, and pr is faster overall.

@mann1x
Copy link
Author

mann1x commented Apr 26, 2024

@Jeximo
Thanks for testing. Which CPU do you have?
Where you able to check if the correct cores have been loaded (maybe keeping an eye with htop on another shell)?
Would be nice to test also the command line options, if possible.

@Jeximo
Copy link
Contributor

Jeximo commented Apr 26, 2024

Which CPU do you have?

It's Qualcomm SDM855 Snapdragon 855 - Octa-core (1x2.84 GHz Kryo 485 & 3x2.41 GHz Kryo 485 & 4x1.78 GHz Kryo 485)

maybe keeping an eye with htop on another shell

I tried htop, except it can't access my cores - all read 0%.

Would be nice to test also the command line options

Here's a few runs:

-atc 1

llama_print_timings:        load time =    2881.04 ms
llama_print_timings:      sample time =      56.06 ms /   307 runs   (    0.18 ms per token,  5476.08 tokens per second)
llama_print_timings: prompt eval time =   37623.96 ms /    62 tokens (  606.84 ms per token,     1.65 tokens per second)
llama_print_timings:        eval time =  113151.68 ms /   306 runs   (  369.78 ms per token,     2.70 tokens per second)
llama_print_timings:       total time =  160326.43 ms /   368 tokens
-lltc 1

llama_print_timings:        load time =    2140.17 ms
llama_print_timings:      sample time =      56.09 ms /   307 runs   (    0.18 ms per token,  5472.86 tokens per second)
llama_print_timings: prompt eval time =   57694.27 ms /    62 tokens (  930.55 ms per token,     1.07 tokens per second)
llama_print_timings:        eval time =  111914.57 ms /   306 runs   (  365.73 ms per token,     2.73 tokens per second)
llama_print_timings:       total time =  181670.32 ms /   368 tokens
-acz 1

llama_print_timings:        load time =    1718.84 ms
llama_print_timings:      sample time =      55.58 ms /   307 runs   (    0.18 ms per token,  5523.97 tokens per second)
llama_print_timings: prompt eval time =   18806.12 ms /    62 tokens (  303.32 ms per token,     3.30 tokens per second)
llama_print_timings:        eval time =  110356.45 ms /   306 runs   (  360.64 ms per token,     2.77 tokens per second)
llama_print_timings:       total time =  139162.22 ms /   368 tokens
default options again

llama_print_timings:        load time =    1667.63 ms
llama_print_timings:      sample time =      57.46 ms /   307 runs   (    0.19 ms per token,  5342.57 tokens per second)
llama_print_timings: prompt eval time =   23961.02 ms /    62 tokens (  386.47 ms per token,     2.59 tokens per second)
llama_print_timings:        eval time =  117470.18 ms /   306 runs   (  383.89 ms per token,     2.60 tokens per second)
llama_print_timings:       total time =  147956.72 ms /   368 tokens

It's Android, so core 0 is the biggest. -acz seems to show that.

@bmtwl
Copy link
Contributor

bmtwl commented Apr 28, 2024

Hello @mann1x
This is a very interesting patch that is quite a bit more sophisticated than the current "ith" approach to thread affinity, and has a few other options that would be difficult to achieve with just the --numa numactl flag (in conjunction with the numactl utility, which only works on Linux anyways)
I also happen to be working on some CPU core level pinning code as part of a memory locality patch for improving numa performance. I haven't tried to do e-core/p-core selection or automatic thread count, but am using a hash to put threads into the same numa node as the tensors that they are going to be running inference on. The hash is based on the order that the tensor was loaded and attempts to keep thread/core/tensor affinity consistent throughout inference.
I think the two patches could probably coexist and provide mutual benefits, but maybe the current location in pthread_create() isn't ideal? In the master branch thread affinity is being set in ggml_graph_compute_thread() with set_numa_thread_affinity(). I see that this is commented out in your branch, effectively disabling numa controls globally. (although I assume that's just for testing)
Maybe you could make use of the existing code pathways in set_numa_thread_affinity() in the if (!ggml_is_numa()) code block that executes if the numa flag isn't set? The function naming would become confusing in that case, but it feels to me like the natural place to attempt this work.
PS: If you have a higher end modern AMD consumer CPU, there's a good chance you can test running in explicit numa mode by setting the "NPS" setting in you BIOS.

@mann1x
Copy link
Author

mann1x commented Apr 28, 2024

Hello @mann1x This is a very interesting patch that is quite a bit more sophisticated than the current "ith" approach to thread affinity, and has a few other options that would be difficult to achieve with just the --numa numactl flag (in conjunction with the numactl utility, which only works on Linux anyways)
PS: If you have a higher end modern AMD consumer CPU, there's a good chance you can test running in explicit numa mode by setting the "NPS" setting in you BIOS.

Yes, support for Windows should be quite easy as the information is available on CPUSet.
I have a 5950x indeed, it's set to NPS0 right now.
But I will switch to numa aware and develop it for Windows.
Will have to rely on someone else testing on Linux.

I also happen to be working on some CPU core level pinning code as part of a memory locality patch for improving numa performance. I haven't tried to do e-core/p-core selection or automatic thread count, but am using a hash to put threads into the same numa node as the tensors that they are going to be running inference on. The hash is based on the order that the tensor was loaded and attempts to keep thread/core/tensor affinity consistent throughout inference. I think the two patches could probably coexist and provide mutual benefits, but maybe the current location in pthread_create() isn't ideal? In the master branch thread affinity is being set in ggml_graph_compute_thread() with set_numa_thread_affinity(). I see that this is commented out in your branch, effectively disabling numa controls globally. (although I assume that's just for testing) Maybe you could make use of the existing code pathways in set_numa_thread_affinity() in the if (!ggml_is_numa()) code block that executes if the numa flag isn't set? The function naming would become confusing in that case, but it feels to me like the natural place to attempt this work.

It would be really nice to work together on this. The hash to stick the tensor affinity as well is exactly what I was thinking about.
But to be honest I still have to understand 99% of the compute pipeline :)
How can we work to merge the PRs? I have no idea...

I did use the pthread_create() cause it was just there.
Started this considering only Windows so it came natural.

The issue with the existing pathway. if I understand it correctly, is that the affinity is set on the calling thread and once the threads are spawned, it's being reset with clear_numa_thread_affinity();.
This won't work on Windows and it doesn't support affinity per worker thread.
The main and only pathway for me should be in this loop:

// create thread pool if (n_threads > 1) { for (int j = 1; j < n_threads; ++j) {

Here each thread should be immediately set to the right affinity per core.
Doing it later will harm performances, especially on Windows but in general on all platforms, because having a swarm of threads migrating from one core to another it's a sensible workload which will have an impact during computation, especially with a lot of CPUs.

What do you think?

I see that this is commented out in your branch, effectively disabling numa controls globally. (although I assume that's just for testing)

Ehm no, probably a mistake... where do I have to look?
I tried to use it so maybe when I removed the code I have inadvertently removed something else as well.

@arch-btw
Copy link
Contributor

This is interesting. Here are my results:

PR

llama_print_timings:        load time =    1612.56 ms
llama_print_timings:      sample time =       3.21 ms /    32 runs   (    0.10 ms per token,  9953.34 tokens per second)
llama_print_timings: prompt eval time =    3677.81 ms /    10 tokens (  367.78 ms per token,     2.72 tokens per second)
llama_print_timings:        eval time =    9243.79 ms /    31 runs   (  298.19 ms per token,     3.35 tokens per second)
llama_print_timings:       total time =   13915.33 ms /    41 tokens

master + OpenBLAS

llama_print_timings:        load time =    1603.83 ms
llama_print_timings:      sample time =       3.65 ms /    27 runs   (    0.14 ms per token,  7399.29 tokens per second)
llama_print_timings: prompt eval time =    3278.74 ms /    10 tokens (  327.87 ms per token,     3.05 tokens per second)
llama_print_timings:        eval time =    7489.52 ms /    26 runs   (  288.06 ms per token,     3.47 tokens per second)
llama_print_timings:       total time =   11436.00 ms /    36 tokens

Master is using OpenBLAS though, should I disable it?

n_threads = 3 / 8

PR is selecting only 3/8 cores, but I have 4/8 (physical) cores.

@mann1x
Copy link
Author

mann1x commented Apr 28, 2024

@arch-btw
With only 4 cores, you may need to enable the Core 0 as well to get the same performances, try with -acz 1.
Yes, you may have to disable OpenBLAS to compare 1:1, not sure.

@arch-btw
Copy link
Contributor

arch-btw commented Apr 28, 2024

@mann1x ok that helped!

PR + -acz 1

llama_print_timings:        load time =    1597.64 ms
llama_print_timings:      sample time =       3.13 ms /    30 runs   (    0.10 ms per token,  9575.49 tokens per second)
llama_print_timings: prompt eval time =    2961.59 ms /    10 tokens (  296.16 ms per token,     3.38 tokens per second)
llama_print_timings:        eval time =    7247.62 ms /    29 runs   (  249.92 ms per token,     4.00 tokens per second)
llama_print_timings:       total time =   10840.06 ms /    39 tokens

@mann1x
Copy link
Author

mann1x commented Apr 28, 2024

@arch-btw
If you can please try again with the latest commit, I added the automatic inclusion of Core 0 below 6 threads.
I wonder if it works.

@arch-btw
Copy link
Contributor

@mann1x

Great job, it selects it automatically now:

Start: get_num_physical_cores
Check for Logical CPU: 0
CPU 0 is physical, siblings: 0003
Check for Logical CPU: 1
Check for Logical CPU: 2
CPU 2 is physical, siblings: 000c
Check for Logical CPU: 3
Check for Logical CPU: 4
CPU 4 is physical, siblings: 0030
Check for Logical CPU: 5
Check for Logical CPU: 6
CPU 6 is physical, siblings: 00c0
Check for Logical CPU: 7
Check for Logical CPU: 8
get_num_physical_cores Physical CPU count: 4

system_info: n_threads = 4 / 8

llama_print_timings:        load time =    1535.93 ms
llama_print_timings:      sample time =       3.35 ms /    26 runs   (    0.13 ms per token,  7772.80 tokens per second)
llama_print_timings: prompt eval time =    3131.33 ms /    10 tokens (  313.13 ms per token,     3.19 tokens per second)
llama_print_timings:        eval time =    6485.68 ms /    25 runs   (  259.43 ms per token,     3.85 tokens per second)
llama_print_timings:       total time =   10486.39 ms /    35 tokens

@mann1x
Copy link
Author

mann1x commented Apr 29, 2024

If there's someone using an Intel CPU with E-Cores, please test that the E-Cores are being excluded from the selection under Windows and Linux.

@ggerganov
Copy link
Owner

ggerganov commented Apr 29, 2024

Btw, don't test performance with sanitizers enabled: -DLLAMA_SANITIZE_ADDRESS=ON
Only Release builds without sanitizers

@Jeximo
Copy link
Contributor

Jeximo commented Apr 29, 2024

Btw, don't test performance with sanitizers enabled

Thanks for clarifying.

On the other hand, #6832 is no longer building for me:

build error log

(master)> gh pr checkout #6832
remote: Enumerating objects: 55, done.
remote: Counting objects: 100% (47/47), done.
remote: Compressing objects: 100% (23/23), done.
remote: Total 55 (delta 34), reused 34 (delta 24), pack-reused 8
Unpacking objects: 100% (55/55), 166.26 KiB | 803.00 KiB/s, done. * [new ref] refs/pull/6832/head -> mannix-win32-cpuset
Switched to branch 'mannix-win32-cpuset'

~/llama2 (mannix-win32-cpuset) cmake -B build -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod+i8mm && cd build && cmake --build . --config Release --target server --target main

-- The C compiler identification is Clang 18.1.4
-- The CXX compiler identification is Clang 18.1.4
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /data/data/com.termux/files/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /data/data/com.termux/files/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /data/data/com.termux/files/usr/bin/git (found version "2.44.0")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- ccache found, compilation results will be cached. Disable with LLAMA_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- ARM detected
-- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
-- Configuring done (3.6s)
-- Generating done (0.3s)
-- Build files have been written to: /data/data/com.termux/files/home/llama2/build
[ 9%] Generating build details from Git
-- Found Git: /data/data/com.termux/files/usr/bin/git (found version "2.44.0")
[ 9%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[ 9%] Built target build_info
[ 9%] Building C object CMakeFiles/ggml.dir/ggml.c.o
/data/data/com.termux/files/home/llama2/ggml.c:1635:5: warning: implicit conversion increases floating-point precision: 'float32_t' (aka 'float') to 'ggml_float' (aka 'double') [-Wdouble-promotion]
1635 | GGML_F16_VEC_REDUCE(sumf, sum);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/data/data/com.termux/files/home/llama2/ggml.c:1055:41: note: expanded from macro 'GGML_F16_VEC_REDUCE'
1055 | #define GGML_F16_VEC_REDUCE GGML_F32Cx4_REDUCE
| ^
/data/data/com.termux/files/home/llama2/ggml.c:1045:38: note: expanded from macro 'GGML_F32Cx4_REDUCE'
1045 | #define GGML_F32Cx4_REDUCE GGML_F32x4_REDUCE
| ^
/data/data/com.termux/files/home/llama2/ggml.c:975:11: note: expanded from macro 'GGML_F32x4_REDUCE'
975 | res = GGML_F32x4_REDUCE_ONE(x[0]);
| ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/data/data/com.termux/files/home/llama2/ggml.c:960:34: note: expanded from macro 'GGML_F32x4_REDUCE_ONE'
960 | #define GGML_F32x4_REDUCE_ONE(x) vaddvq_f32(x)
| ^~~~~~~~~~~~~
/data/data/com.termux/files/home/llama2/ggml.c:1683:9: warning: implicit conversion increases floating-point precision: 'float32_t' (aka 'float') to 'ggml_float' (aka 'double') [-Wdouble-promotion]
1683 | GGML_F16_VEC_REDUCE(sumf[k], sum[k]);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/data/data/com.termux/files/home/llama2/ggml.c:1055:41: note: expanded from macro 'GGML_F16_VEC_REDUCE'
1055 | #define GGML_F16_VEC_REDUCE GGML_F32Cx4_REDUCE
| ^
/data/data/com.termux/files/home/llama2/ggml.c:1045:38: note: expanded from macro 'GGML_F32Cx4_REDUCE'
1045 | #define GGML_F32Cx4_REDUCE GGML_F32x4_REDUCE
| ^
/data/data/com.termux/files/home/llama2/ggml.c:975:11: note: expanded from macro 'GGML_F32x4_REDUCE'
975 | res = GGML_F32x4_REDUCE_ONE(x[0]);
| ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/data/data/com.termux/files/home/llama2/ggml.c:960:34: note: expanded from macro 'GGML_F32x4_REDUCE_ONE'
960 | #define GGML_F32x4_REDUCE_ONE(x) vaddvq_f32(x)
| ^~~~~~~~~~~~~
2 warnings generated.
[ 18%] Building C object CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 27%] Building C object CMakeFiles/ggml.dir/ggml-backend.c.o
[ 27%] Building C object CMakeFiles/ggml.dir/ggml-quants.c.o
/data/data/com.termux/files/home/llama2/ggml-quants.c:3705:46: warning: arithmetic on a pointer to void is a GNU extension [-Wgnu-pointer-arith]
3705 | const block_q4_0 * restrict vx1 = vx + bx;
| ~~ ^
/data/data/com.termux/files/home/llama2/ggml-quants.c:3708:46: warning: arithmetic on a pointer to void is a GNU extension [-Wgnu-pointer-arith]
3708 | const block_q8_0 * restrict vy1 = vy + by;
| ~~ ^
/data/data/com.termux/files/home/llama2/ggml-quants.c:4072:46: warning: arithmetic on a pointer to void is a GNU extension [-Wgnu-pointer-arith]
4072 | const block_q4_1 * restrict vx1 = vx + bx;
| ~~ ^
/data/data/com.termux/files/home/llama2/ggml-quants.c:4074:46: warning: arithmetic on a pointer to void is a GNU extension [-Wgnu-pointer-arith]
4074 | const block_q8_1 * restrict vy1 = vy + by;
| ~~ ^
/data/data/com.termux/files/home/llama2/ggml-quants.c:4885:46: warning: arithmetic on a pointer to void is a GNU extension [-Wgnu-pointer-arith]
4885 | const block_q8_0 * restrict vx1 = vx + bx;
| ~~ ^
/data/data/com.termux/files/home/llama2/ggml-quants.c:4887:46: warning: arithmetic on a pointer to void is a GNU extension [-Wgnu-pointer-arith]
4887 | const block_q8_0 * restrict vy1 = vy + by;
| ~~ ^
6 warnings generated.
[ 27%] Built target ggml
[ 27%] Building CXX object CMakeFiles/llama.dir/llama.cpp.o
[ 36%] Building CXX object CMakeFiles/llama.dir/unicode.cpp.o
[ 36%] Building CXX object CMakeFiles/llama.dir/unicode-data.cpp.o
[ 45%] Linking CXX static library libllama.a
[ 45%] Built target llama
[ 54%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
/data/data/com.termux/files/home/llama2/common/common.cpp:776:9: error: use of undeclared identifier 'is_hybrid_cpu'
776 | if (is_hybrid_cpu()) {
| ^
/data/data/com.termux/files/home/llama2/common/common.cpp:778:14: error: use of undeclared identifier 'pthread_getaffinity_np'; did you mean 'sched_getaffinity'?
778 | if (!pthread_getaffinity_np(pthread_self(), sizeof(affinity), &affinity)) {
| ^~~~~~~~~~~~~~~~~~~~~~
| sched_getaffinity
/data/data/com.termux/files/usr/include/sched.h:251:5: note: 'sched_getaffinity' declared here
251 | int sched_getaffinity(pid_t __pid, size_t __set_size, cpu_set_t* _Nonnull __set);
| ^
/data/data/com.termux/files/home/llama2/common/common.cpp:779:26: error: use of undeclared identifier 'count_math_cpus'
779 | int result = count_math_cpus(cpu_count);
| ^
/data/data/com.termux/files/home/llama2/common/common.cpp:780:13: error: use of undeclared identifier 'pthread_setaffinity_np'; did you mean 'sched_setaffinity'?
780 | pthread_setaffinity_np(pthread_self(), sizeof(affinity), &affinity);
| ^~~~~~~~~~~~~~~~~~~~~~
| sched_setaffinity
/data/data/com.termux/files/usr/include/sched.h:243:5: note: 'sched_setaffinity' declared here
243 | int sched_setaffinity(pid_t __pid, size_t __set_size, const cpu_set_t* _Nonnull __set);
| ^
4 errors generated.

make[3]: *** [common/CMakeFiles/common.dir/build.make:76: common/CMakeFiles/common.dir/common.cpp.o] Error 1
make[2]: *** [CMakeFiles/Makefile2:1672: common/CMakeFiles/common.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:3465: examples/server/CMakeFiles/server.dir/rule] Error 2
make: *** [Makefile:1401: server] Error 2

@mann1x
Copy link
Author

mann1x commented Apr 29, 2024

On the other hand, #6832 is no longer building for me:

I have no idea why...

@ggerganov
There's a conflict with common.cpp but it doesn't tell me what...
I can't merge anymore my fork and it doesn't tell me why.
Sorry but I have no idea how to fix this.
Maybe it's just something basic I'm missing, any tip?

@Jeximo
Copy link
Contributor

Jeximo commented Apr 29, 2024

On the other hand, #6832 is no longer building for me:

I have no idea why...

It looks like is_hybird_cpu checks affinity, but AFAIK android does not support setting affinity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

5 participants