-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of CPU multi-threading #80
Conversation
rho_m0_thread[iz_cell+1 + shift_z, ir_cell+1 + shift_r] += R_m0_11 | ||
rho_m1_thread[iz_cell+1 + shift_z, ir_cell+1 + shift_r] += R_m1_11 | ||
|
||
# Write thread local deposition arrays to global deposition arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a random thought: wouldn't the operation below be faster if we initialize rho_m0_global with a different shape (i.e. such that we do rho_m0_global[tx,:,:] = rho_m0_thread
and thus the copying is contiguous in memory).
I agree that then the reduction operation is then maybe not as fast, but we could also write a numba reduction function that uses prange (with parallelization across the z index for instance).
Allow better scaling of the deposition with number of threads
Regarding the threading for the BLAS operation on CPU: it is already used by default in the current |
fbpic/particles/particles.py
Outdated
@@ -764,7 +764,7 @@ def deposit( self, fld, fieldtype ) : | |||
# Register particle chunk size for each thread | |||
tx_N = int(self.Ntot/self.nthreads) | |||
tx_chunks = [ tx_N for k in range(self.nthreads) ] | |||
tx_chunks[-1] = tx_chunks[-1] + (tx_N)%(self.nthreads) | |||
tx_chunks[-1] = tx_chunks[-1] + self.Ntot%(self.nthreads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! thx
Implement parallel reduce
fbpic/particles/particles.py
Outdated
# Register particle chunk size for each thread | ||
tx_N = int(self.Ntot/self.nthreads) | ||
tx_chunks = [ tx_N for k in range(self.nthreads) ] | ||
tx_chunks[-1] = tx_chunks[-1] + int(self.Ntot%self.nthreads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MKirchen Would you be okay with using an array for tx_chunks
instead of a list?
I am worried that a list may be cumbersome for numa, since numba then has to check the type of every single element of the list, each time it is calling a jitted function (as opposed to just once for an array).
Threaded grids, and avoiding duplication
Thread shift window
This is a fantastic pull request!! Thank you very much! 🎉 ✨ ✨ |
ToDo: