Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of CPU multi-threading #80

Merged
merged 41 commits into from
Jul 22, 2017
Merged

Implementation of CPU multi-threading #80

merged 41 commits into from
Jul 22, 2017

Conversation

MKirchen
Copy link
Contributor

@MKirchen MKirchen commented Jul 14, 2017

  • Reworked particle methods (to not use weights function anymore)
  • Introduced prange (parallel execution on multiple threads) for most important particle methods (push, deposition, gathering)
  • Refactored "particles" folder and code. New cleaner structure.
  • Removed "linear_non_atomic" GPU deposition method. Although this one might be faster in some cases, we don't use it anymore by default. Moreover, the atomic_add version is much more cleaner and therefore, we should only stick to this one. (Also, in a test on the Tesla P100 both deposition schemes had the same speed)

ToDo:

  • Adapt imports in tests so that they do not fail anymore
  • implement cubic shape order deposition methods (just need to copy/paste and adapt from GPU version) Needed to fix the cubic gathering methods by using while index_z < 4 instead of a for loop Should ask numba developers, as this should be a bug
  • implement prange for field methods ( use prange and not vectorize as I think prange is faster - use prange over both axis ) - Question to @RemiLehe : Should we keep the old methods and adapt to the structure of the particle folder or use prange as default?
  • Maybe: Write our own reduction function to replace np.sum() in deposit - or we wait until numba prange supports the axis argument in np.sum() (this is planned to be implemented) Merged, but: Is it possible to also prange over Nr?
  • Maybe: Move allocation of global deposition arrays to init function such that we don't have to reallocate each time deposit is called. (we should have enough memory available with CPU, so that might be good to do) _We would need to pass Nz, Nr to the Particles object.
  • @RemiLehe Is there threading with blas() for CPU? (or do we just use array multiplication with numba? I forgot how its implemented for CPU.)
  • Nice to have: Adapt GPU and CPU gather function/kernel to match the structure of the cubic gathering function (maybe we get same performance but with a cleaner code). Moreover, check if by now lists are maybe supported with CUDA kernels. Then we could get rid of cuda.local.array(). Or maybe we directly write iz/ir/Sz/Sr to the registers.
  • Check availability of prange: Similar to cuda_installed we should check the availability of prange

rho_m0_thread[iz_cell+1 + shift_z, ir_cell+1 + shift_r] += R_m0_11
rho_m1_thread[iz_cell+1 + shift_z, ir_cell+1 + shift_r] += R_m1_11

# Write thread local deposition arrays to global deposition arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a random thought: wouldn't the operation below be faster if we initialize rho_m0_global with a different shape (i.e. such that we do rho_m0_global[tx,:,:] = rho_m0_thread and thus the copying is contiguous in memory).

I agree that then the reduction operation is then maybe not as fast, but we could also write a numba reduction function that uses prange (with parallelization across the z index for instance).

@RemiLehe
Copy link
Member

Regarding the threading for the BLAS operation on CPU: it is already used by default in the current dev branch. I think it is controlled by the environment variable MKL_NUM_THREADS.

@@ -764,7 +764,7 @@ def deposit( self, fld, fieldtype ) :
# Register particle chunk size for each thread
tx_N = int(self.Ntot/self.nthreads)
tx_chunks = [ tx_N for k in range(self.nthreads) ]
tx_chunks[-1] = tx_chunks[-1] + (tx_N)%(self.nthreads)
tx_chunks[-1] = tx_chunks[-1] + self.Ntot%(self.nthreads)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! thx

@RemiLehe RemiLehe mentioned this pull request Jul 17, 2017
2 tasks
# Register particle chunk size for each thread
tx_N = int(self.Ntot/self.nthreads)
tx_chunks = [ tx_N for k in range(self.nthreads) ]
tx_chunks[-1] = tx_chunks[-1] + int(self.Ntot%self.nthreads)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MKirchen Would you be okay with using an array for tx_chunks instead of a list?
I am worried that a list may be cumbersome for numa, since numba then has to check the type of every single element of the list, each time it is calling a jitted function (as opposed to just once for an array).

@RemiLehe RemiLehe changed the title [WIP] Implementation of CPU multi-threading Implementation of CPU multi-threading Jul 20, 2017
@RemiLehe
Copy link
Member

This is a fantastic pull request!! Thank you very much! 🎉 ✨ ✨

@RemiLehe RemiLehe merged commit 68d4554 into dev Jul 22, 2017
@RemiLehe RemiLehe deleted the cpuprange branch July 22, 2017 05:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants