-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
weird pin thread to cpu performance #12
Comments
There's no facility to pin threads for memory bandwidth testing in non-NUMA mode because it is not needed. You can use other utilities to set affinity, like |
|
At one point I had an option to put the first thread on core 0, second thread on core 1, and so on, but found that it made no difference compared to setting affinity through If you have a problem with the operating system not being SMT-aware, you can use NUMA gets special handling because each thread allocates memory from a designated pool of memory, and has to be pinned to a core close to that pool. |
Make sense, thanks. I think this issue is solved. |
Hi @clamchowder, I've just reopened this issue because I am still unable to explain the left figure of the original post (the comparison betwen I tried both Could you repeat the results and try to help explain? Thanks, |
Please don't do any affinity setting unless you're willing to investigate and debug the effects on your own time. If you choose to do that, tools like Affinity setting is not supported in general, and was only done to work around issues on certain platforms. |
Hi @clamchowder,
I want to pin thread to cpu when measuring bandwidth, but I found that there seems no such facility under the non-numa mode. So I just borrow this part from CoherenceLatency:
Besides, the
processorIndex
is calculated bythread_idx % nprocs
according to the processor to core id mapping from/proc/cpuinfo
.I test on AMD Ryzen 7 5800X CPU, where only one numa node is equipped (8 physical cores, and 16 logical cores). So I didn't enable numa.
I got the following results:
In the figure above, "auto" means I run original MemoryBandwith code, while "manual" means I added the
CPU_SET
andsched_setaffinity
as the code snippet shows. The left and right figures show 8 and 16 threads results respectively.My question is, why are the "manual" bandwidth results lower than those of "auto" for 8 threads, while the "manual" catches up for 16 threads?
Thanks,
troore
The text was updated successfully, but these errors were encountered: