Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weird pin thread to cpu performance #12

Closed
troore opened this issue Jul 5, 2024 · 6 comments
Closed

weird pin thread to cpu performance #12

troore opened this issue Jul 5, 2024 · 6 comments

Comments

@troore
Copy link

troore commented Jul 5, 2024

Hi @clamchowder,

I want to pin thread to cpu when measuring bandwidth, but I found that there seems no such facility under the non-numa mode. So I just borrow this part from CoherenceLatency:

void *ReadBandwidthTestThread(void *param) {
    BandwidthTestThreadData* bwTestData = (BandwidthTestThreadData*)param;
    if (hardaffinity) { 
        sched_setaffinity(gettid(), sizeof(cpu_set_t), &global_cpuset);
    } else {
        // I add the following lines:
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(bwTestData->processorIndex, &cpuset);
        sched_setaffinity(gettid(), sizeof(cpu_set_t), &cpuset);
        fprintf(stderr, "thread %ld set affinity %d\n", gettid(), bwTestData->processorIndex);
    }
   ...
}

Besides, the processorIndex is calculated by thread_idx % nprocs according to the processor to core id mapping from /proc/cpuinfo.

I test on AMD Ryzen 7 5800X CPU, where only one numa node is equipped (8 physical cores, and 16 logical cores). So I didn't enable numa.

I got the following results:

image

In the figure above, "auto" means I run original MemoryBandwith code, while "manual" means I added the CPU_SET and sched_setaffinity as the code snippet shows. The left and right figures show 8 and 16 threads results respectively.

My question is, why are the "manual" bandwidth results lower than those of "auto" for 8 threads, while the "manual" catches up for 16 threads?

Thanks,
troore

@clamchowder
Copy link
Owner

There's no facility to pin threads for memory bandwidth testing in non-NUMA mode because it is not needed. You can use other utilities to set affinity, like taskset on Linux or start /b /affinity <mask> on Windows to ensure the test only runs on certain physical cores.

@troore
Copy link
Author

troore commented Jul 5, 2024

There's no facility to pin threads for memory bandwidth testing in non-NUMA mode because it is not needed. You can use other utilities to set affinity, like taskset on Linux or start /b /affinity <mask> on Windows to ensure the test only runs on certain physical cores.

  1. Why in non-NUMA mode pin threads is not needed?
  2. How could taskset guarantee the precise affinity, e.g., if we pin 2 threads to 2 physical cores (SMT2, 4 logical cores) by taskset, can we guarantee that the 2 threads are scheduled on physical core 0 and 1, rather than both being scheduled on physical core 0, resulting in different L1/L2 bandwidth results?

@clamchowder
Copy link
Owner

At one point I had an option to put the first thread on core 0, second thread on core 1, and so on, but found that it made no difference compared to setting affinity through taskset or start /b /affinity for the whole process. Operating systems today are SMT-aware and are good at preferring to load separate physical cores before loading SMT threads.

If you have a problem with the operating system not being SMT-aware, you can use taskset or start /b /affinity to exclude SMT sibling threads. I haven't seen it be a problem on any recent Windows or Linux install.

NUMA gets special handling because each thread allocates memory from a designated pool of memory, and has to be pinned to a core close to that pool.

@troore
Copy link
Author

troore commented Jul 6, 2024

At one point I had an option to put the first thread on core 0, second thread on core 1, and so on, but found that it made no difference compared to setting affinity through taskset or start /b /affinity for the whole process. Operating systems today are SMT-aware and are good at preferring to load separate physical cores before loading SMT threads.

If you have a problem with the operating system not being SMT-aware, you can use taskset or start /b /affinity to exclude SMT sibling threads. I haven't seen it be a problem on any recent Windows or Linux install.

NUMA gets special handling because each thread allocates memory from a designated pool of memory, and has to be pinned to a core close to that pool.

Make sense, thanks. I think this issue is solved.

@troore troore closed this as completed Jul 6, 2024
@troore troore reopened this Jul 9, 2024
@troore
Copy link
Author

troore commented Jul 9, 2024

Hi @clamchowder,

I want to pin thread to cpu when measuring bandwidth, but I found that there seems no such facility under the non-numa mode. So I just borrow this part from CoherenceLatency:

void *ReadBandwidthTestThread(void *param) {
    BandwidthTestThreadData* bwTestData = (BandwidthTestThreadData*)param;
    if (hardaffinity) { 
        sched_setaffinity(gettid(), sizeof(cpu_set_t), &global_cpuset);
    } else {
        // I add the following lines:
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(bwTestData->processorIndex, &cpuset);
        sched_setaffinity(gettid(), sizeof(cpu_set_t), &cpuset);
        fprintf(stderr, "thread %ld set affinity %d\n", gettid(), bwTestData->processorIndex);
    }
   ...
}

Besides, the processorIndex is calculated by thread_idx % nprocs according to the processor to core id mapping from /proc/cpuinfo.

I test on AMD Ryzen 7 5800X CPU, where only one numa node is equipped (8 physical cores, and 16 logical cores). So I didn't enable numa.

I got the following results:

image In the figure above, "auto" means I run original MemoryBandwith code, while "manual" means I added the `CPU_SET` and `sched_setaffinity` as the code snippet shows. The left and right figures show 8 and 16 threads results respectively.

My question is, why are the "manual" bandwidth results lower than those of "auto" for 8 threads, while the "manual" catches up for 16 threads?

Thanks, troore

Hi @clamchowder,

I've just reopened this issue because I am still unable to explain the left figure of the original post (the comparison betwen auto and manual thread bindings). Because I think 8 threads are enough to fully utilize L1 bandwidth before the first slope.

I tried both taskset -c 0-7 and sched_setaffinity but got similar results. The affinity masks of the auto and manual are ffff and ff__. I cannot explain why the manual thread binding is lower than auto.

Could you repeat the results and try to help explain?

Thanks,
troore

@clamchowder
Copy link
Owner

Please don't do any affinity setting unless you're willing to investigate and debug the effects on your own time. If you choose to do that, tools like perf and performance counters can help you understand what's going on.

Affinity setting is not supported in general, and was only done to work around issues on certain platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants