-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jialei gh297 speed up gpu computation on ei grad ei #351
Jialei gh297 speed up gpu computation on ei grad ei #351
Conversation
…_up_gpu_computation_on_EI_gradEI
…_up_gpu_computation_on_EI_gradEI Conflicts: moe/optimal_learning/cpp/gpu/gpp_cuda_math.cu
@@ -134,6 +134,7 @@ __global__ void CudaComputeEIGpu(double const * __restrict__ mu, double const * | |||
chunk_size = (num_union - 1)/ blockDim.x + 1; | |||
CudaCopyElements(chunk_size * idx, chunk_size * (idx + 1), num_union, mu, mu_local); | |||
__syncthreads(); | |||
double * normals = &mu_local[num_union]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__restrict__
. Also, your other shared_mem double *
pointers should be marked restrict too (and const if applicable)
also, please edit the shared memory comments to describe where normals goes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also have you ever worked with "constant" memory? i'm not sure if that's appropriate for mu and chol_var--yes they are constant, but gpu constant memory has some additional requirements on how a warp of threads accesses the data to get the best performance.
still if it's usable that could reduce some memory pressure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more: also indicate in docs how the memory is laid out for random and how much memory is required. (i.e., point out that it's sized [num_union][num_threads] so each thread as a block of num_union numbers)
you should mention this stuff in the function's docstring b/c callers have to know how much shared memory to specify when they launch the kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two more:
&mu_local[num_union]
is the same asmu_local + num_union
and the latter is more clear imo- idx is fixed right? Why not set
normals = mu_local + num_union + idx * num_union
? also again be very specific about the ordering of these matrices in shared memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
woohoo speedups! |
A few things to do next:
|
…_up_gpu_computation_on_EI_gradEI
@@ -99,7 +99,7 @@ __forceinline__ __device__ void CudaCopyElements(int begin, int end, int bound, | |||
} | |||
|
|||
/*!\rst | |||
Device code to compute Expected Improvement by Monte-Carlo on GPU | |||
GPU kernel function of computing Expected Improvement using Monte-Carlo. | |||
\param |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there needs to be a newline btwn the last text and param, e.g.,
blah blah blah
\param
:foo: stuff
\output
:bar: more stuff
o/w sphinx gets confused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
|
||
* chol_var_local[num_union][num_union]: copy of chol_var in shared memory for each block | ||
* mu_local[num_union]: copy of mu in shared memory for each block | ||
* normals[num_union][num_threads]: shared memory for storage of normal random numbers for each block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I goofed with the suggestion here. It should have been:
:chol_var_local[num_union][num_union]: blah blah
:mu_local[...]:
:etc:
that will format it like the parameter lists; there just isn't a \param
shortcut to make a heading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
2 more docs-only changes. Also, could you update CHANGELOG.md? |
(num_union * num_union + num_union + num_union * num_threads) | ||
|
||
doubles in total in shared memory. The order of the arrays placed in this shared memory is like | ||
[chol_var_local, mu_local, normals] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put the mathish things and variable names in double backticks:
``(num_union * ...)``
``[chol_var_local, ...]``
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
shipit |
…on_on_EI_gradEI Jialei gh297 speed up gpu computation on ei grad ei
********* PEOPLE *************
Primary reviewer: @suntzu86
Reviewers: @sc932
********* DESCRIPTION **************
Branch Name: jialei_gh297_speed_up_gpu_computation_on_EI_gradEI
Ticket(s)/Issue(s): Closes #297
********* TESTING DONE *************
make test
cpplint.py