Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jialei gh297 speed up gpu computation on ei grad ei #351

Merged
merged 11 commits into from
Aug 8, 2014

Conversation

jialeiwang
Copy link
Contributor

********* PEOPLE *************
Primary reviewer: @suntzu86

Reviewers: @sc932

********* DESCRIPTION **************
Branch Name: jialei_gh297_speed_up_gpu_computation_on_EI_gradEI
Ticket(s)/Issue(s): Closes #297

********* TESTING DONE *************
make test
cpplint.py

@@ -134,6 +134,7 @@ __global__ void CudaComputeEIGpu(double const * __restrict__ mu, double const *
chunk_size = (num_union - 1)/ blockDim.x + 1;
CudaCopyElements(chunk_size * idx, chunk_size * (idx + 1), num_union, mu, mu_local);
__syncthreads();
double * normals = &mu_local[num_union];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__restrict__. Also, your other shared_mem double * pointers should be marked restrict too (and const if applicable)

also, please edit the shared memory comments to describe where normals goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also have you ever worked with "constant" memory? i'm not sure if that's appropriate for mu and chol_var--yes they are constant, but gpu constant memory has some additional requirements on how a warp of threads accesses the data to get the best performance.

still if it's usable that could reduce some memory pressure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more: also indicate in docs how the memory is laid out for random and how much memory is required. (i.e., point out that it's sized [num_union][num_threads] so each thread as a block of num_union numbers)

you should mention this stuff in the function's docstring b/c callers have to know how much shared memory to specify when they launch the kernel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two more:

  1. &mu_local[num_union] is the same as mu_local + num_union and the latter is more clear imo
  2. idx is fixed right? Why not set normals = mu_local + num_union + idx * num_union? also again be very specific about the ordering of these matrices in shared memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@suntzu86
Copy link
Contributor

suntzu86 commented Aug 5, 2014

woohoo speedups!
left you some mostly organizational and doc'ing comments

@jialeiwang
Copy link
Contributor Author

A few things to do next:

  1. check out "constant memory"
  2. reorder grad_chol_var_local to achieve more efficient read, and other similar cases need to optimize in the same way

@@ -99,7 +99,7 @@ __forceinline__ __device__ void CudaCopyElements(int begin, int end, int bound,
}

/*!\rst
Device code to compute Expected Improvement by Monte-Carlo on GPU
GPU kernel function of computing Expected Improvement using Monte-Carlo.
\param
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there needs to be a newline btwn the last text and param, e.g.,

blah blah blah

\param
  :foo: stuff
\output
  :bar: more stuff

o/w sphinx gets confused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@suntzu86
Copy link
Contributor

suntzu86 commented Aug 8, 2014

  1. couple of docs-only changes. looking good!
  2. see my earlier comment about documenting max problem sizes before shared_mem runs out. (fixed, indicated in docs)
  3. you should ticket your two TODOs above (constant mem and reorganizing grad_chol_var)


* chol_var_local[num_union][num_union]: copy of chol_var in shared memory for each block
* mu_local[num_union]: copy of mu in shared memory for each block
* normals[num_union][num_threads]: shared memory for storage of normal random numbers for each block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I goofed with the suggestion here. It should have been:

:chol_var_local[num_union][num_union]: blah blah
:mu_local[...]:
:etc:

that will format it like the parameter lists; there just isn't a \param shortcut to make a heading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@suntzu86
Copy link
Contributor

suntzu86 commented Aug 8, 2014

2 more docs-only changes.

Also, could you update CHANGELOG.md?

(num_union * num_union + num_union + num_union * num_threads)

doubles in total in shared memory. The order of the arrays placed in this shared memory is like
[chol_var_local, mu_local, normals]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put the mathish things and variable names in double backticks:

``(num_union * ...)``
``[chol_var_local, ...]``

etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@suntzu86
Copy link
Contributor

suntzu86 commented Aug 8, 2014

shipit

jialeiwang added a commit that referenced this pull request Aug 8, 2014
…on_on_EI_gradEI

Jialei gh297 speed up gpu computation on ei grad ei
@jialeiwang jialeiwang merged commit 1eeb1a4 into master Aug 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up GPU computation on EI & gradEI
2 participants