-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Improving performance of broadcast_axis on GPU #18168
Conversation
Hey @access2rohit , Thanks for submitting the PR
CI supported jobs: [windows-gpu, edge, centos-cpu, unix-cpu, unix-gpu, windows-cpu, miscellaneous, website, sanity, centos-gpu, clang] Note: |
@mxnet-label-bot add [pr-work-in-progress] |
0c0d9a4
to
4a9d417
Compare
const int32_t ndim) { | ||
int32_t idx = i; | ||
int32_t in_idx = i; | ||
#pragma unroll 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you benchmark if unroll makes a measurable difference? Background: currently libmxnet.so is quite large, and we may want to remove loop unrolling and force inline to reduce the size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the loop has only 4 iterations. @ptrendx suggested that when using array from struct compiler doesn't know the index at compile time.
So unroll can give better performance.
@leezu To answer your question : yes, there is improvement in performance when unrolling. Slightly more when using unroll 4
compared to unroll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@access2rohit Could you provide data. How much performance improvement with unroll and unroll 4?
Also the index_t in this kernel is still int64_t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apeforest
The performance gain is about 4% with unroll.
I wanted to keep the implementation consistent with existing code so I kept it index_t. Since the target was to make sure LTS perf >= master no-LTS performance with the new approach it still there.
I can also hard code every variable inside GPU kernel to be int32_t and there will be 10-15% further gain for LTS MXNet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is confusing to mix index_t with int in this kernel since your variable i
in the Map function is int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aah ... my bad will fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on your PR description, the only difference between CPU and GPU kernel is that GPU kernel uses int32_t for indexing and CPU kernel uses int64_t (when large tensor compiler flag is on) for indexing, duplicating the kernel implementation into two structs seems an overdo to me. Could we optimize the code such that they still share the same kernel template but with different index_t type? Also it will be nice to re-architect the code so that it applies to all operators that share the same template between CPU and GPU kernels.
const int32_t ndim) { | ||
int32_t idx = i; | ||
int32_t in_idx = i; | ||
#pragma unroll 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@access2rohit Could you provide data. How much performance improvement with unroll and unroll 4?
Also the index_t in this kernel is still int64_t?
Hence, 2 structs are not overdo but necessary to reduce impact otherwise changes would require performing stride calculations inside implementation of approx 15 operators thereby significantly increasing changes required. Again typedef is compile time and cannot be dynamically decided. Updated the description now. It wasn't consistent as the PR was WIP. And now it is ready for review ! |
@mxnet-bot run ci [macosx-x86_64, edge, miscellaneous, clang] |
Jenkins CI successfully triggered : [unix-cpu, clang, unix-gpu, miscellaneous, edge] |
Jenkins CI successfully triggered : [edge, clang, miscellaneous] |
@mxnet-label-bot update [pr-awaiting-review] |
@ptrendx can you review this PR ? |
I did not mean |
@apeforest : kernels can be fused together using templates. But this approach hinders CPU performance but there is another PR that actually boosts it for CPU: #17882 . Since there I am leveraging vectorization. |
for (index_t iter = ndim - 1; iter >= 0; --iter) { | ||
index_t out_dim_shape = aux_data.output_shape[iter]; | ||
index_t out_dim_stride = aux_data.out_stride[iter]; | ||
index_t dim_idx = idx - (idx / out_dim_shape) * out_dim_shape; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really improve performance compared to using %
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very little not more than 2%. But if you look at nvprof for "ALU instructions issued" they are reduced in number
Kernel<broadcast_kernel<mshadow_op::identity>, xpu>::Launch( | ||
s, out.shape_.Size(), data.dptr_, out.dptr_, in_shape, out_shape, req[0], 2); | ||
if (ctx.run_ctx.get_ctx().dev_type == Context::kGPU) { | ||
Kernel<broadcast_kernel_gpu<mshadow_op::identity>, xpu>::Launch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed something, but the gpu and cpu kernels look highly similar. Adding a branch here makes the code very fragmented since this entire function template<typename xpu> inline void BroadcastComputeImpl
is supposed to be generic template for both CPU and GPU.
I think we need a better way to code this. Two approaches I can think of
(1) consolidate cpu and gpu kernel
(2) instead of a generic template, break it into two different functions.
(1) is preferred in my opinion.
@@ -128,7 +128,7 @@ NNVM_REGISTER_OP(_broadcast_backward) | |||
.set_attr<FResourceRequest>("FResourceRequest", | |||
[](const NodeAttrs& attrs) { | |||
return std::vector<ResourceRequest>{ResourceRequest::kTempSpace}; | |||
}); | |||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing indent here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indent wasn't supposed to be there. It must align with NNVM. For example: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/matrix_op.cc#L446-L491
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why must it align with NNVM? There is no open braces for NNVM. This closing braces is matched to the set_attr function.
Please see https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/matrix_op.cc#L751
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saw other examples in the codebase. Makes sense !
Jenkins CI successfully triggered : [unix-gpu, sanity, clang, edge, website, centos-gpu, centos-cpu, miscellaneous, windows-gpu, unix-cpu, windows-cpu] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much nicer. Only a few comments about styling. Otherwise good to go. Thanks for your effort.
@@ -1049,29 +1049,55 @@ void ReduceAxesBackwardUseInOut(const nnvm::NodeAttrs& attrs, | |||
ReduceAxesBackwardUseInOutImpl<xpu, OP, normalize>(ctx, small, inputs, req, outputs); | |||
} | |||
|
|||
namespace { // unnamed namespace to keep scope of the struct within the file | |||
struct shape_and_stride { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use CamelCase naming convention for struct. See this: https://google.github.io/styleguide/cppguide.html#Type_Names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. But its confusing that when struct is used as kernel then its using _
but when just as struct it is camel cased in our code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @apeforest in terms of the CamelCase standard.
if (in_shape[iter] != 1) { | ||
in_idx += dim_idx * in_stride; | ||
#pragma unroll 4 | ||
for (index_t iter = ndim - 1; iter >= 0; --iter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iter is int type
for (index_t iter = ndim - 1; iter >= 0; --iter) { | ||
index_t out_dim_shape = aux_data.output_shape[iter]; | ||
index_t out_dim_stride = aux_data.out_stride[iter]; | ||
index_t dim_idx = idx - (idx / out_dim_shape) * out_dim_shape; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think compiler can optimize % to this. Let's keep modulo here for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiler doesn't do that:
nvcc is not intelligent enough to do this
gcc doesn't need to since this doesn't slowdown CPU performance in any measurable way.
I have added a comment to explain that this is modulo operation.
aux_data->output_shape[iter] = out_shape[iter]; | ||
iter--; | ||
for (; iter >= 0; --iter) { | ||
aux_data->out_stride[iter] = aux_data->out_stride[iter+1] * out_shape[iter+1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add spaces around operators like +
With the latest change, could you please re-run the opperf and paste the latest performance results? |
@ptrendx Your review will be appreciated. |
@apeforest Actually there isn't any change in the result after re-running opperf I kept updating description if there were any differences in results. But currently they are consistent with the ones in description. I will paste BERT results in few mins. |
@eric-haibin-lin @sxjscience @apeforest @szhengac BERT (Training Run Single GPU on p3.16xl)
new no-LT
new LT
master no-LT
master-LT
|
…nd fast access shape data in both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators * using structure instead of temp workspace to pass stride and shape * replacing hardcoded int32_t with generic index_t * combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators * using structure instead of temp workspace to pass stride and shape * replacing hardcoded int32_t with generic index_t * combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators * using structure instead of temp workspace to pass stride and shape * replacing hardcoded int32_t with generic index_t * combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators * using structure instead of temp workspace to pass stride and shape * replacing hardcoded int32_t with generic index_t * combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators * using structure instead of temp workspace to pass stride and shape * replacing hardcoded int32_t with generic index_t * combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators * using structure instead of temp workspace to pass stride and shape * replacing hardcoded int32_t with generic index_t * combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
* Improving performance of broadcast_axis on GPU (#18168) * adding separate int32_t kernel for GPU in broadcast_axis/to/like operators * using structure instead of temp workspace to pass stride and shape * replacing hardcoded int32_t with generic index_t * combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu> * Improve performance of broadcast_axis on CPU (#17882) * adding comments explaining code optimizations * fixing broadcast_axis kernel to int32 * fixing slice_axis kernel to int32 * combining CPU and GPU implementation method signatures and cleaned up code * adding new broadcast_axis to np_matmul Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu> Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
Description
improving GPU kernel performance with cached stride calculations and shape data to improve performance.
This change is done in order to maintain BERT performance when large tensor is enabled by default.
CPU performance remains identical or better than master-lt and master respectively(no changes there)
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Testing
Results
Profiling Code: