Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make AccCpuThreads parallelize over blocks #1135

Open
jkelling opened this issue Aug 25, 2020 · 7 comments
Open

Make AccCpuThreads parallelize over blocks #1135

jkelling opened this issue Aug 25, 2020 · 7 comments

Comments

@jkelling
Copy link
Member

AccCpuThreads is currently a bad showcase of C++11 threads as it uses the sub-optimal strategy of spawning CPU threads at thread instead of block level just like the equally useless AccOmp2Threads.

To escape Amdahl's law as long as possible you must parallelize at as high a level at possible. AccCpuThreads should do this.

I am aware, that AccCpuThreads is, and will remain, only a demonstration, thus there will not be many resources to be spend on improving it. Just putting the issue here for reference, based on a discussion we had.

@BenjaminW3
Copy link
Member

BenjaminW3 commented Aug 25, 2020

Similar to AccOmp2Threads and AccOmp2Blocks we could call the current backend AccCpuStdThreadThreads and add a new one AccCpuStdThreadBlocks.

However, it is not yet clear to me how this can improve things. Yes, we could simply execute multiple blocks in parallel. However, this restricts the block size to only a single thread. This is necessary because threads within a block can be explicitly synchronized/interrupted which is not possible by default with std::threads executing a function (in our case a kernel). We can not simply execute one kernel invocation after the other for all threads in a block.
I do not see how this would be faster than the current version. We are already using a thread-pool, so the threads are already not recreated for each block.

What we would need is a mix of cooperative multitasking (std::threads) for the blocks and preemptive-multitasking for the threads so that we get the most benefits from caches because a single std::thread executes a whole block.

Accelerator Back-end Execution strategy grid-blocks Execution strategy block-threads
AccCpuStdThreadThreads sequential parallel (preemptive multitasking)
AccCpuStdThreadBlocks parallel (preemptive multitasking) sequential (only 1 thread per block)
AccCpuFibers sequential parallel (cooperative multitasking)
Mixed parallel (preemptive multitasking) parallel (cooperative multitasking)

I already proposed this in #22.

@jkelling
Copy link
Member Author

jkelling commented Aug 25, 2020

However, it is not yet clear to me how this can improve things. Yes, we could simply execute multiple blocks in parallel. However, this restricts the block size to only a single thread.

Yes, having only one thread per block, like AccOmp2Blocks is the idea. Alpaka kernels are probably mostly written with GPU (i.e. the CUDA backend) in mind, which will often lead to one the following two properties:

  1. small workload per thread
  2. synchronization between threads used for collective operations which cater to GPU memory architechture/SIMT

I do not see how this would be faster than the current version. We are already using a thread-pool, so the threads are already not recreated for each block.

The two cases above are currently slow because:
1.
* With a small workload even the thread pool will have produce noticeable overhead when distributing alpaka threads to the OS threads from the pool.
* The parallel work in the threads is either sequentially scheduled by block (invoking Ahmdal's law) or we are running so many threads, that we get congestion issues on the CPU.
2. Collective operations implemented as optimizations for GPU do not help on CPU while the synchronization they require is significantly more expensive on CPU ("warp-size" of one) then on GPU (warp-size of 32/64). The overhead would be removed by having only one thread per block.

What we would need is a mix of cooperative multitasking (std::threads) for the blocks and preemptive-multitasking for the threads so that we get the most benefits from caches.

With the type of GPU codes I described above, one would ideally map block to OS threads and threads to SIMD, because this would be closest to the GPU architectures. I know, that this is not viable, because SIMD lacks some important bits of SIMT (i.e. __syncthreads() and scattered memory load/store). The best we can is to map OS threads to blocks, ignore threads and leave SIMD to the element layer.

I already proposed this in #22.
I do not know boost fiber, but the issue could be the same.

@jkelling
Copy link
Member Author

Btw. OS threads sharing data in a block are probably just as likely to slow each other down due to false sharing (causing each others caches to be flushed) as they are to profit from sharing cache lines.

@BenjaminW3
Copy link
Member

Btw. OS threads sharing data in a block are probably just as likely to slow each other down due to false sharing (causing each others caches to be flushed) as they are to profit from sharing cache lines.

Yes, this is probably one of the main issues with the current version of AccCpuThreads.
The exact opposite would be true for my Mixed proposal.

@jkelling
Copy link
Member Author

Btw. OS threads sharing data in a block are probably just as likely to slow each other down due to false sharing (causing each others caches to be flushed) as they are to profit from sharing cache lines.

Yes, this is probably one of the main issues with the current version of AccCpuThreads.
The exact opposite would be true for my Mixed proposal.

Are suggesting to implement premtive multi-tasking a thread-level in alpaka by hand? Maybe using coroutines?

@BenjaminW3
Copy link
Member

Boost.Fiber is a library with an interface similar to std::thread but based on Boost.Context which implements cooperative multitasking by hand.

@j-stephan
Copy link
Member

The Intel oneAPI compiler for CPUs solves the issue in the following way:

  1. SYCL work-groups (= blocks) are equivalent to a single CPU thread.
  2. SYCL work-items (= threads) are equivalent to SIMD lanes.

So when the compiler encounters a work-group it will launch its auto-vectorizer. I wonder if we can do something similar in alpaka, maybe by making better use of the element layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants