Implementation of CPU multi-threading #80

MKirchen · 2017-07-14T16:59:35Z

Reworked particle methods (to not use weights function anymore)
Introduced prange (parallel execution on multiple threads) for most important particle methods (push, deposition, gathering)
Refactored "particles" folder and code. New cleaner structure.
Removed "linear_non_atomic" GPU deposition method. Although this one might be faster in some cases, we don't use it anymore by default. Moreover, the atomic_add version is much more cleaner and therefore, we should only stick to this one. (Also, in a test on the Tesla P100 both deposition schemes had the same speed)

ToDo:

Adapt imports in tests so that they do not fail anymore
implement cubic shape order deposition methods (just need to copy/paste and adapt from GPU version) Needed to fix the cubic gathering methods by using while index_z < 4 instead of a for loop Should ask numba developers, as this should be a bug
implement prange for field methods ( use prange and not vectorize as I think prange is faster - use prange over both axis ) - Question to @RemiLehe : Should we keep the old methods and adapt to the structure of the particle folder or use prange as default?
Maybe: Write our own reduction function to replace np.sum() in deposit - or we wait until numba prange supports the axis argument in np.sum() (this is planned to be implemented) Merged, but: Is it possible to also prange over Nr?
Maybe: Move allocation of global deposition arrays to init function such that we don't have to reallocate each time deposit is called. (we should have enough memory available with CPU, so that might be good to do) _We would need to pass Nz, Nr to the Particles object.
@RemiLehe Is there threading with blas() for CPU? (or do we just use array multiplication with numba? I forgot how its implemented for CPU.)
Nice to have: Adapt GPU and CPU gather function/kernel to match the structure of the cubic gathering function (maybe we get same performance but with a cleaner code). Moreover, check if by now lists are maybe supported with CUDA kernels. Then we could get rid of cuda.local.array(). Or maybe we directly write iz/ir/Sz/Sr to the registers.
Check availability of prange: Similar to cuda_installed we should check the availability of prange

RemiLehe · 2017-07-15T17:00:13Z

fbpic/particles/deposition/threading_methods.py

+            rho_m0_thread[iz_cell+1 + shift_z, ir_cell+1 + shift_r] += R_m0_11
+            rho_m1_thread[iz_cell+1 + shift_z, ir_cell+1 + shift_r] += R_m1_11
+
+        # Write thread local deposition arrays to global deposition arrays


Just a random thought: wouldn't the operation below be faster if we initialize rho_m0_global with a different shape (i.e. such that we do rho_m0_global[tx,:,:] = rho_m0_thread and thus the copying is contiguous in memory).

I agree that then the reduction operation is then maybe not as fast, but we could also write a numba reduction function that uses prange (with parallelization across the z index for instance).

Allow better scaling of the deposition with number of threads

RemiLehe · 2017-07-16T15:25:04Z

Regarding the threading for the BLAS operation on CPU: it is already used by default in the current dev branch. I think it is controlled by the environment variable MKL_NUM_THREADS.

MKirchen · 2017-07-17T12:04:30Z

fbpic/particles/particles.py

@@ -764,7 +764,7 @@ def deposit( self, fld, fieldtype ) :
            # Register particle chunk size for each thread
            tx_N = int(self.Ntot/self.nthreads) 
            tx_chunks = [ tx_N for k in range(self.nthreads) ]
-            tx_chunks[-1] = tx_chunks[-1] + (tx_N)%(self.nthreads)
+            tx_chunks[-1] = tx_chunks[-1] + self.Ntot%(self.nthreads)


good catch! thx

Implement parallel reduce

RemiLehe · 2017-07-20T15:52:08Z

fbpic/particles/particles.py

+            # Register particle chunk size for each thread
+            tx_N = int(self.Ntot/self.nthreads) 
+            tx_chunks = [ tx_N for k in range(self.nthreads) ]
+            tx_chunks[-1] = tx_chunks[-1] + int(self.Ntot%self.nthreads)


@MKirchen Would you be okay with using an array for tx_chunks instead of a list?
I am worried that a list may be cumbersome for numa, since numba then has to check the type of every single element of the list, each time it is calling a jitted function (as opposed to just once for an array).

Threaded grids, and avoiding duplication

Thread shift window

RemiLehe · 2017-07-22T05:35:43Z

This is a fantastic pull request!! Thank you very much! 🎉 ✨ ✨

MKirchen and others added 4 commits July 14, 2017 18:51

Initial CPU multi-threading implementation

9ae48f3

Initial CPU multi-threading implementation (part 2)

497c55a

Fix pyflakes errors

8f2e3ff

Print number of threads along with number of MPI procs

380b326

RemiLehe reviewed Jul 15, 2017

View reviewed changes

RemiLehe and others added 3 commits July 15, 2017 18:29

Swapped the order of global arrays + removed thread-local arrays

af15196

Fix pyflakes errors

1e252be

Merge pull request #82 from RemiLehe/better_thread_scaling

f8536d7

Allow better scaling of the deposition with number of threads

MKirchen added cpu enhancement performance refactoring labels Jul 16, 2017

MKirchen added this to the 0.5.0 milestone Jul 16, 2017

MKirchen assigned MKirchen and RemiLehe Jul 16, 2017

This was referenced Jul 16, 2017

Use prange on CPU, once available #71

Closed

Implement several CPU optimizations #70

Closed

RemiLehe added 6 commits July 16, 2017 12:48

Merge branch 'dev' into cpuprange

e539347

Corrected import pattern in laser antenna

f9ce679

Fix thread index calculation

eedf019

Fix automated tests

3a5f0f0

Implement parallel reduce

a3c2248

Added docstring to the function

da28bbe

MKirchen commented Jul 17, 2017

View reviewed changes

MKirchen added 5 commits July 17, 2017 15:06

Merge pull request #83 from RemiLehe/parallel_reduction

1d9c4a7

Implement parallel reduce

Added cubic deposition functions

f70cca1

Adapted particles.py for cubic prange deposition

2053e1f

Removed linear_non_atomic shape from uniform_rho test

392354d

Corrected some bugs introduced in last commits

7055845

Fix cubic deposition and cubic gathering

0a22108

RemiLehe mentioned this pull request Jul 17, 2017

Remove old numba methods #86

Closed

2 tasks

RemiLehe added 14 commits July 18, 2017 21:55

Remove function signature in field methods

c2f75b3

Create threading_utils.py

18486c1

Check if threading is installed in main.py

fce6692

Added threaded methods for the fields

b6c3584

Added parallel capability for grid methods

0c52a62

Removed threaded push methods

8458a0d

Corrected push_x's return

6317aa3

Correct push_p and push_x with return function

c817b44

Give the right threading flag to particles

f1f2ab2

Threaded the routines that convert from p/m to r/t components

54bd6f0

Remove the flag use_threading as an input argument

5d2c7e5

Correct pyflakes errors

62e37c5

Thread the shifting of the grid in spectral space

076b668

Fix the threaded shift function

0e654a1

RemiLehe reviewed Jul 20, 2017

View reviewed changes

RemiLehe and others added 3 commits July 20, 2017 09:51

Remove arguments nthreads

11ce49d

Merge pull request #89 from RemiLehe/threaded_grids

ffe436c

Threaded grids, and avoiding duplication

Merge pull request #91 from RemiLehe/thread_shift_window

93149c0

Thread shift window

RemiLehe changed the title ~~[WIP] Implementation of CPU multi-threading~~ Implementation of CPU multi-threading Jul 20, 2017

RemiLehe added 5 commits July 20, 2017 14:16

Modified import structure of the prange function

77be6cf

Replace line endings to unix style

7772819

Replace tx_chunks by an array

afd5d3b

Changes in variable names and docstring

d35a7c1

Removed all mentions of linear_non_atomic

6204585

RemiLehe merged commit 68d4554 into dev Jul 22, 2017

RemiLehe deleted the cpuprange branch July 22, 2017 05:36

RemiLehe mentioned this pull request Jul 28, 2017

Current dev branch fails to compile on GPU #103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of CPU multi-threading #80

Implementation of CPU multi-threading #80

MKirchen commented Jul 14, 2017 •

edited by RemiLehe

RemiLehe Jul 15, 2017

RemiLehe commented Jul 16, 2017

MKirchen Jul 17, 2017

RemiLehe Jul 20, 2017

RemiLehe commented Jul 22, 2017

Implementation of CPU multi-threading #80

Implementation of CPU multi-threading #80

Conversation

MKirchen commented Jul 14, 2017 • edited by RemiLehe

RemiLehe Jul 15, 2017

Choose a reason for hiding this comment

RemiLehe commented Jul 16, 2017

MKirchen Jul 17, 2017

Choose a reason for hiding this comment

RemiLehe Jul 20, 2017

Choose a reason for hiding this comment

RemiLehe commented Jul 22, 2017

MKirchen commented Jul 14, 2017 •

edited by RemiLehe