Improving performance of broadcast_axis on GPU #18168

access2rohit · 2020-04-25T13:48:08Z

Description

improving GPU kernel performance with cached stride calculations and shape data to improve performance.

passing mshadow::Shape slowed the kernel speed by approx 15% so passed it separately in a struct.
unroll gives boost of about 4%
major chunk of improvement by caching stride calculations and combining 2 redundant multiplications

This change is done in order to maintain BERT performance when large tensor is enabled by default.

CPU performance remains identical or better than master-lt and master respectively(no changes there)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Testing

ubuntu@ip-172-31-0-156 ~/workspace/incubator-mxnet (typedef) $ MXNET_TEST_COUNT=1 nosetests --logging-level=DEBUG --verbose -s tests/python/gpu/test_operator_gpu.py:test_broadcast

[DEBUG] 1000 of 1000: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1707871653 to reproduce.
ok

----------------------------------------------------------------------
Ran 1 test in 1772.105s

OK

Results

Profiling Code:

import mxnet as mx
import mxnet.ndarray as nd
from benchmark.opperf.utils.benchmark_utils import run_performance_test

#local_ctx = mx.cpu()
local_ctx = mx.gpu()


add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=local_ctx,
                               inputs=[{'data': (10, 1, 10, 1), 'axis': (1, 3), 'size': (1000, 5000)}],
                               warmup=50, runs=500, profiler='python')
print(add_res)

add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=local_ctx,
                               inputs=[{'data': (100, 1, 1, 10), 'axis': (1, 2), 'size': (1000, 500)}],
                               warmup=50, runs=500, profiler='python')
print(add_res)

add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=local_ctx,
                               inputs=[{'data': (1, 1, 1, 1), 'axis': (0, 1, 2, 3), 'size': (10000, 10, 100, 50)}],
                               warmup=50, runs=500, profiler='python')
print(add_res)

add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=local_ctx,
                               inputs=[{'data': (1, 10, 1, 10, 1), 'axis': (0, 2, 4), 'size': (200, 1000, 50)}],
                               warmup=50, runs=500, profiler='python')
print(add_res)

code version	cases	avg	p50	p90
master LT	(10, 1, 10, 1) -> (10, 1000, 10, 5000)	65.75	65.71	65.82
	(100, 1, 1, 10)->(100, 1000, 500, 10)	64.65	64.57	64.68
	(1, 1, 1, 1)->(10000, 10, 100, 50)	46.61	46.54	46.64
	(1, 10, 1, 10, 1)->(200, 10, 1000, 10, 50)	128.52	128.49	128.71

master no-LT	(10, 1, 10, 1) -> (10, 1000, 10, 5000)	28.42	28.37	28.45
	(100, 1, 1, 10)->(100, 1000, 500, 10)	27.38	27.30	27.37
	(1, 1, 1, 1)->(10000, 10, 100, 50)	20.48	20.35	20.41
	(1, 10, 1, 10, 1)->(200, 10, 1000, 10, 50)	53.68	53.57	53.74

new LT	(10, 1, 10, 1) -> (10, 1000, 10, 5000)	27.24	27.10	27.18
	(100, 1, 1, 10)->(100, 1000, 500, 10)	25.42	25.34	25.40
	(1, 1, 1, 1)->(10000, 10, 100, 50)	14.99	14.83	14.93
	(1, 10, 1, 10, 1)->(200, 10, 1000, 10, 50)	51.66	51.80	51.90

new no-LT	(10, 1, 10, 1) -> (10, 1000, 10, 5000)	22.18	22.10	22.18
	(100, 1, 1, 10)->(100, 1000, 500, 10)	20.28	20.27	20.32
	(1, 1, 1, 1)->(10000, 10, 100, 50)	12.73	12.69	12.73
	(1, 10, 1, 10, 1)->(200, 10, 1000, 10, 50)	40.21	40.11	40.28

mxnet-bot · 2020-04-25T13:48:12Z

Hey @access2rohit , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, edge, centos-cpu, unix-cpu, unix-gpu, windows-cpu, miscellaneous, website, sanity, centos-gpu, clang]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

access2rohit · 2020-04-25T13:48:32Z

@mxnet-label-bot add [pr-work-in-progress]

src/operator/tensor/broadcast_reduce_op.h

leezu · 2020-04-29T03:30:23Z

src/operator/tensor/broadcast_reduce_op.h

+                                  const int32_t ndim) {
+    int32_t idx = i;
+    int32_t in_idx = i;
+#pragma unroll 4


Did you benchmark if unroll makes a measurable difference? Background: currently libmxnet.so is quite large, and we may want to remove loop unrolling and force inline to reduce the size

the loop has only 4 iterations. @ptrendx suggested that when using array from struct compiler doesn't know the index at compile time.
So unroll can give better performance.

@leezu To answer your question : yes, there is improvement in performance when unrolling. Slightly more when using unroll 4compared to unroll

@access2rohit Could you provide data. How much performance improvement with unroll and unroll 4?

Also the index_t in this kernel is still int64_t?

@apeforest
The performance gain is about 4% with unroll.
I wanted to keep the implementation consistent with existing code so I kept it index_t. Since the target was to make sure LTS perf >= master no-LTS performance with the new approach it still there.

I can also hard code every variable inside GPU kernel to be int32_t and there will be 10-15% further gain for LTS MXNet.

It is confusing to mix index_t with int in this kernel since your variable i in the Map function is int

aah ... my bad will fix it

apeforest

Based on your PR description, the only difference between CPU and GPU kernel is that GPU kernel uses int32_t for indexing and CPU kernel uses int64_t (when large tensor compiler flag is on) for indexing, duplicating the kernel implementation into two structs seems an overdo to me. Could we optimize the code such that they still share the same kernel template but with different index_t type? Also it will be nice to re-architect the code so that it applies to all operators that share the same template between CPU and GPU kernels.

apeforest · 2020-04-29T05:36:50Z

src/operator/tensor/broadcast_reduce_op.h

+                                  const int32_t ndim) {
+    int32_t idx = i;
+    int32_t in_idx = i;
+#pragma unroll 4


@access2rohit Could you provide data. How much performance improvement with unroll and unroll 4?

Also the index_t in this kernel is still int64_t?

access2rohit · 2020-04-29T06:28:32Z

Based on your PR description, the only difference between CPU and GPU kernel is that GPU kernel uses int32_t for indexing and CPU kernel uses int64_t (when large tensor compiler flag is on) for indexing, duplicating the kernel implementation into two structs seems an overdo to me. Could we optimize the code such that they still share the same kernel template but with different index_t type? Also it will be nice to re-architect the code so that it applies to all operators that share the same template between CPU and GPU kernels.

@apeforest

We cannot use typedef based on xpu because typdef is compile time option.
The blast radius is large if we make generic CPU/GPU kernel. Numpy ops that call broadcast_axis kernel for their internal operations need to be updated as well, so i created broadcast_axis_gpu for ops relying only on BroadcastComputeImpl internally which is evident from 43 GPU failing tests when temp workspace was used.

Hence, 2 structs are not overdo but necessary to reduce impact otherwise changes would require performing stride calculations inside implementation of approx 15 operators thereby significantly increasing changes required.

Again typedef is compile time and cannot be dynamically decided.

Updated the description now. It wasn't consistent as the PR was WIP. And now it is ready for review !

access2rohit · 2020-04-29T06:47:43Z

@mxnet-bot run ci [macosx-x86_64, edge, miscellaneous, clang]

mxnet-bot · 2020-04-29T06:47:52Z

Jenkins CI successfully triggered : [unix-cpu, clang, unix-gpu, miscellaneous, edge]

mxnet-bot · 2020-04-29T06:48:05Z

Jenkins CI successfully triggered : [edge, clang, miscellaneous]

access2rohit · 2020-04-29T07:14:52Z

@mxnet-label-bot update [pr-awaiting-review]

access2rohit · 2020-04-29T07:22:42Z

@ptrendx can you review this PR ?

apeforest · 2020-04-29T08:27:03Z

Based on your PR description, the only difference between CPU and GPU kernel is that GPU kernel uses int32_t for indexing and CPU kernel uses int64_t (when large tensor compiler flag is on) for indexing, duplicating the kernel implementation into two structs seems an overdo to me. Could we optimize the code such that they still share the same kernel template but with different index_t type? Also it will be nice to re-architect the code so that it applies to all operators that share the same template between CPU and GPU kernels.

@apeforest

We cannot use typedef based on xpu because typdef is compile time option.

The blast radius is large if we make generic CPU/GPU kernel. Numpy ops that call broadcast_axis kernel for their internal operations need to be updated as well, so i created broadcast_axis_gpu for ops relying only on BroadcastComputeImpl internally which is evident from 43 GPU failing tests when temp workspace was used.

Hence, 2 structs are not overdo but necessary to reduce impact otherwise changes would require performing stride calculations inside implementation of approx 15 operators thereby significantly increasing changes required.

Again typedef is compile time and cannot be dynamically decided.

Updated the description now. It wasn't consistent as the PR was WIP. And now it is ready for review !

I did not mean typedef the only way to do so. Can we leverage the template feature. In fact, the Map function in GPU kernel is always using int instead of index_t

src/operator/tensor/broadcast_reduce_op.h

access2rohit · 2020-04-29T16:29:31Z

Based on your PR description, the only difference between CPU and GPU kernel is that GPU kernel uses int32_t for indexing and CPU kernel uses int64_t (when large tensor compiler flag is on) for indexing, duplicating the kernel implementation into two structs seems an overdo to me. Could we optimize the code such that they still share the same kernel template but with different index_t type? Also it will be nice to re-architect the code so that it applies to all operators that share the same template between CPU and GPU kernels.

@apeforest

We cannot use typedef based on xpu because typdef is compile time option.

The blast radius is large if we make generic CPU/GPU kernel. Numpy ops that call broadcast_axis kernel for their internal operations need to be updated as well, so i created broadcast_axis_gpu for ops relying only on BroadcastComputeImpl internally which is evident from 43 GPU failing tests when temp workspace was used.

Hence, 2 structs are not overdo but necessary to reduce impact otherwise changes would require performing stride calculations inside implementation of approx 15 operators thereby significantly increasing changes required.
Again typedef is compile time and cannot be dynamically decided.
Updated the description now. It wasn't consistent as the PR was WIP. And now it is ready for review !

I did not mean typedef the only way to do so. Can we leverage the template feature. In fact, the Map function in GPU kernel is always using int instead of index_t

@apeforest : kernels can be fused together using templates. But this approach hinders CPU performance but there is another PR that actually boosts it for CPU: #17882 . Since there I am leveraging vectorization.

apeforest · 2020-04-29T16:34:02Z

src/operator/tensor/broadcast_reduce_op.h

+    for (index_t iter = ndim - 1; iter >= 0; --iter) {
+      index_t out_dim_shape = aux_data.output_shape[iter];
+      index_t out_dim_stride = aux_data.out_stride[iter];
+      index_t dim_idx = idx - (idx / out_dim_shape) * out_dim_shape;


Does this really improve performance compared to using %?

very little not more than 2%. But if you look at nvprof for "ALU instructions issued" they are reduced in number

apeforest · 2020-04-29T21:17:08Z

src/operator/tensor/broadcast_reduce_op.h

-        Kernel<broadcast_kernel<mshadow_op::identity>, xpu>::Launch(
-          s, out.shape_.Size(), data.dptr_, out.dptr_, in_shape, out_shape, req[0], 2);
+        if (ctx.run_ctx.get_ctx().dev_type == Context::kGPU) {
+          Kernel<broadcast_kernel_gpu<mshadow_op::identity>, xpu>::Launch(


Maybe I missed something, but the gpu and cpu kernels look highly similar. Adding a branch here makes the code very fragmented since this entire function template<typename xpu> inline void BroadcastComputeImpl is supposed to be generic template for both CPU and GPU.

I think we need a better way to code this. Two approaches I can think of
(1) consolidate cpu and gpu kernel
(2) instead of a generic template, break it into two different functions.

(1) is preferred in my opinion.

apeforest · 2020-04-29T21:17:50Z

src/operator/tensor/broadcast_reduce_op_value.cc

@@ -128,7 +128,7 @@ NNVM_REGISTER_OP(_broadcast_backward)
 .set_attr<FResourceRequest>("FResourceRequest",
  [](const NodeAttrs& attrs) {
    return std::vector<ResourceRequest>{ResourceRequest::kTempSpace};
-  });
+});


Why removing indent here?

Indent wasn't supposed to be there. It must align with NNVM. For example: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/matrix_op.cc#L446-L491

Why must it align with NNVM? There is no open braces for NNVM. This closing braces is matched to the set_attr function.
Please see https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/matrix_op.cc#L751

saw other examples in the codebase. Makes sense !

mxnet-bot · 2020-04-30T23:36:24Z

Jenkins CI successfully triggered : [unix-gpu, sanity, clang, edge, website, centos-gpu, centos-cpu, miscellaneous, windows-gpu, unix-cpu, windows-cpu]

apeforest

Looks much nicer. Only a few comments about styling. Otherwise good to go. Thanks for your effort.

apeforest · 2020-05-01T01:16:43Z

src/operator/tensor/broadcast_reduce_op.h

@@ -1049,29 +1049,55 @@ void ReduceAxesBackwardUseInOut(const nnvm::NodeAttrs& attrs,
  ReduceAxesBackwardUseInOutImpl<xpu, OP, normalize>(ctx, small, inputs, req, outputs);
 }

+namespace {  // unnamed namespace to keep scope of the struct within the file
+struct shape_and_stride {


Please use CamelCase naming convention for struct. See this: https://google.github.io/styleguide/cppguide.html#Type_Names

Will do. But its confusing that when struct is used as kernel then its using _ but when just as struct it is camel cased in our code

I agree with @apeforest in terms of the CamelCase standard.

src/operator/tensor/broadcast_reduce_op.h

apeforest · 2020-05-01T01:18:53Z

src/operator/tensor/broadcast_reduce_op.h

-      if (in_shape[iter] != 1) {
-        in_idx += dim_idx * in_stride;
+#pragma unroll 4
+    for (index_t iter = ndim - 1; iter >= 0; --iter) {


iter is int type

apeforest · 2020-05-01T01:19:33Z

src/operator/tensor/broadcast_reduce_op.h

+    for (index_t iter = ndim - 1; iter >= 0; --iter) {
+      index_t out_dim_shape = aux_data.output_shape[iter];
+      index_t out_dim_stride = aux_data.out_stride[iter];
+      index_t dim_idx = idx - (idx / out_dim_shape) * out_dim_shape;


I think compiler can optimize % to this. Let's keep modulo here for better readability.

Compiler doesn't do that:
nvcc is not intelligent enough to do this
gcc doesn't need to since this doesn't slowdown CPU performance in any measurable way.

I have added a comment to explain that this is modulo operation.

apeforest · 2020-05-01T01:21:19Z

src/operator/tensor/broadcast_reduce_op.h

+  aux_data->output_shape[iter] = out_shape[iter];
+  iter--;
+  for (; iter >= 0; --iter) {
+    aux_data->out_stride[iter] = aux_data->out_stride[iter+1] * out_shape[iter+1];


nit: add spaces around operators like +

apeforest · 2020-05-01T01:23:37Z

With the latest change, could you please re-run the opperf and paste the latest performance results?

apeforest · 2020-05-01T01:25:15Z

@ptrendx Your review will be appreciated.

access2rohit · 2020-05-01T03:18:53Z

With the latest change, could you please re-run the opperf and paste the latest performance results?

@apeforest Actually there isn't any change in the result after re-running opperf I kept updating description if there were any differences in results. But currently they are consistent with the ones in description. I will paste BERT results in few mins.

access2rohit · 2020-05-01T03:21:35Z

@eric-haibin-lin @sxjscience @apeforest @szhengac

BERT (Training Run Single GPU on p3.16xl)
Command:

python3 run_pretraining.py --data='./part-0000.train' --data_eval='./part-0000.train' --num_steps 1000 --lr 1e-4 --optimizer lamb --accumulate 1 --raw --gpus 0 --num_dataset_workers 2 --num_batch_workers 1 --circle_length 1 --total_batch_size 8 --total_batch_size_eval 8 --log_interval 10

new no-LT

INFO:root:[step 950]    mlm_loss=8.22240    mlm_acc=7.16    nsp_loss= 0.72  nsp_acc=48.75   throughput=26.4K tks/s  lr=0.0000051 time=1.06, latency=105.9 ms/step
INFO:root:[step 960]    mlm_loss=8.21638    mlm_acc=7.63    nsp_loss= 0.66  nsp_acc=56.25   throughput=25.3K tks/s  lr=0.0000040 time=1.06, latency=105.8 ms/step
INFO:root:[step 970]    mlm_loss=8.15530    mlm_acc=8.08    nsp_loss= 0.64  nsp_acc=53.75   throughput=26.5K tks/s  lr=0.0000030 time=1.07, latency=106.6 ms/step
INFO:root:[step 980]    mlm_loss=8.19152    mlm_acc=7.63    nsp_loss= 0.66  nsp_acc=68.75   throughput=21.2K tks/s  lr=0.0000020 time=1.06, latency=106.5 ms/step
INFO:root:[step 990]    mlm_loss=8.20629    mlm_acc=7.22    nsp_loss= 0.63  nsp_acc=65.00   throughput=27.7K tks/s  lr=0.0000010 time=1.08, latency=108.0 ms/step
INFO:root:[step 1000]   mlm_loss=8.22327    mlm_acc=7.58    nsp_loss= 0.62  nsp_acc=62.50   throughput=26.4K tks/s  lr=0.0000000 time=1.07, latency=107.3 ms/step
INFO:root:[step 1000] Saving trainer states to ./ckpt_dir/0001000.states.00.
INFO:root:[step 1000] Saving model params to ./ckpt_dir/0001000.params.
INFO:root:Train cost=131.5s

new LT

INFO:root:[step 950]    mlm_loss=8.21955    mlm_acc=7.49    nsp_loss= 0.68  nsp_acc=56.25   throughput=26.2K tks/s  lr=0.0000051 time=1.06, latency=106.3 ms/step
INFO:root:[step 960]    mlm_loss=8.19205    mlm_acc=7.28    nsp_loss= 0.68  nsp_acc=61.25   throughput=28.1K tks/s  lr=0.0000040 time=1.07, latency=106.9 ms/step
INFO:root:[step 970]    mlm_loss=8.13653    mlm_acc=8.54    nsp_loss= 0.67  nsp_acc=61.25   throughput=24.6K tks/s  lr=0.0000030 time=1.06, latency=106.0 ms/step
INFO:root:[step 980]    mlm_loss=8.12760    mlm_acc=8.43    nsp_loss= 0.64  nsp_acc=67.50   throughput=26.1K tks/s  lr=0.0000020 time=1.07, latency=106.9 ms/step
INFO:root:[step 990]    mlm_loss=8.10771    mlm_acc=8.24    nsp_loss= 0.68  nsp_acc=45.00   throughput=25.9K tks/s  lr=0.0000010 time=1.07, latency=107.0 ms/step
INFO:root:[step 1000]   mlm_loss=8.16389    mlm_acc=7.53    nsp_loss= 0.63  nsp_acc=58.75   throughput=28.0K tks/s  lr=0.0000000 time=1.07, latency=106.9 ms/step
INFO:root:[step 1000] Saving trainer states to ./ckpt_dir/0001000.states.00.
INFO:root:[step 1000] Saving model params to ./ckpt_dir/0001000.params.
INFO:root:Train cost=127.3s

master no-LT

INFO:root:[step 950]    mlm_loss=8.27928    mlm_acc=7.14    nsp_loss= 0.67  nsp_acc=53.75   throughput=25.5K tks/s  lr=0.0000051 time=1.06, latency=105.9 ms/step
INFO:root:[step 960]    mlm_loss=8.21552    mlm_acc=6.83    nsp_loss= 0.63  nsp_acc=57.50   throughput=25.0K tks/s  lr=0.0000040 time=1.06, latency=106.1 ms/step
INFO:root:[step 970]    mlm_loss=8.17274    mlm_acc=7.57    nsp_loss= 0.65  nsp_acc=63.75   throughput=24.0K tks/s  lr=0.0000030 time=1.07, latency=106.5 ms/step
INFO:root:[step 980]    mlm_loss=8.20168    mlm_acc=7.15    nsp_loss= 0.66  nsp_acc=52.50   throughput=29.1K tks/s  lr=0.0000020 time=1.07, latency=106.5 ms/step
INFO:root:[step 990]    mlm_loss=8.21061    mlm_acc=7.20    nsp_loss= 0.66  nsp_acc=57.50   throughput=25.9K tks/s  lr=0.0000010 time=1.06, latency=106.1 ms/step
INFO:root:[step 1000]   mlm_loss=8.14879    mlm_acc=7.44    nsp_loss= 0.66  nsp_acc=57.50   throughput=25.1K tks/s  lr=0.0000000 time=1.07, latency=107.0 ms/step
INFO:root:[step 1000] Saving trainer states to ./ckpt_dir/0001000.states.00.
INFO:root:[step 1000] Saving model params to ./ckpt_dir/0001000.params.
INFO:root:Train cost=131.9s

master-LT

INFO:root:[step 950]    mlm_loss=8.15209    mlm_acc=7.47    nsp_loss= 0.63  nsp_acc=55.00   throughput=26.0K tks/s  lr=0.0000051 time=1.08, latency=108.3 ms/step
INFO:root:[step 960]    mlm_loss=8.25082    mlm_acc=7.79    nsp_loss= 0.67  nsp_acc=63.75   throughput=22.6K tks/s  lr=0.0000040 time=1.09, latency=108.6 ms/step
INFO:root:[step 970]    mlm_loss=8.20299    mlm_acc=7.08    nsp_loss= 0.68  nsp_acc=62.50   throughput=26.6K tks/s  lr=0.0000030 time=1.09, latency=108.7 ms/step
INFO:root:[step 980]    mlm_loss=8.10018    mlm_acc=8.48    nsp_loss= 0.70  nsp_acc=52.50   throughput=26.2K tks/s  lr=0.0000020 time=1.09, latency=108.7 ms/step
INFO:root:[step 990]    mlm_loss=8.27777    mlm_acc=7.00    nsp_loss= 0.68  nsp_acc=51.25   throughput=22.3K tks/s  lr=0.0000010 time=1.08, latency=108.1 ms/step
INFO:root:[step 1000]   mlm_loss=8.22438    mlm_acc=7.22    nsp_loss= 0.63  nsp_acc=58.75   throughput=23.8K tks/s  lr=0.0000000 time=1.08, latency=108.3 ms/step
INFO:root:[step 1000] Saving trainer states to ./ckpt_dir/0001000.states.00.
INFO:root:[step 1000] Saving model params to ./ckpt_dir/0001000.params.
INFO:root:Train cost=134.8s

…ators

…nd fast access shape data in both

apeforest

LGTM

* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators * using structure instead of temp workspace to pass stride and shape * replacing hardcoded int32_t with generic index_t * combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>

* Improving performance of broadcast_axis on GPU (#18168) * adding separate int32_t kernel for GPU in broadcast_axis/to/like operators * using structure instead of temp workspace to pass stride and shape * replacing hardcoded int32_t with generic index_t * combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu> * Improve performance of broadcast_axis on CPU (#17882) * adding comments explaining code optimizations * fixing broadcast_axis kernel to int32 * fixing slice_axis kernel to int32 * combining CPU and GPU implementation method signatures and cleaned up code * adding new broadcast_axis to np_matmul Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu> Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>

lanking520 added the pr-work-in-progress PR is still work in progress label Apr 25, 2020

access2rohit force-pushed the typedef branch 3 times, most recently from 0c0d9a4 to 4a9d417 Compare April 28, 2020 08:05

sxjscience reviewed Apr 28, 2020

View reviewed changes

src/operator/tensor/broadcast_reduce_op.h Outdated Show resolved Hide resolved

access2rohit force-pushed the typedef branch from e4c3b88 to d344c29 Compare April 28, 2020 23:26

leezu reviewed Apr 29, 2020

View reviewed changes

access2rohit force-pushed the typedef branch from d344c29 to 59c3763 Compare April 29, 2020 04:41

apeforest suggested changes Apr 29, 2020

View reviewed changes

access2rohit changed the title ~~[WIP]separate GPU kernel for broadcast_axis~~ Separate GPU kernel for broadcast_axis Apr 29, 2020

lanking520 added pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress labels Apr 29, 2020

access2rohit requested a review from apeforest April 29, 2020 07:15

access2rohit force-pushed the typedef branch from 59c3763 to 544a47a Compare April 29, 2020 07:27

access2rohit requested a review from leezu April 29, 2020 07:27

access2rohit force-pushed the typedef branch from 544a47a to 3794d10 Compare April 29, 2020 08:10

sxjscience reviewed Apr 29, 2020

View reviewed changes

src/operator/tensor/broadcast_reduce_op.h Show resolved Hide resolved

apeforest reviewed Apr 29, 2020

View reviewed changes

access2rohit force-pushed the typedef branch from 71336a8 to 9c0403b Compare April 30, 2020 23:43

access2rohit changed the title ~~Separate GPU kernel for broadcast_axis~~ Improving performance of broadcast_axis on GPU Apr 30, 2020

access2rohit force-pushed the typedef branch from 9c0403b to 350fe88 Compare April 30, 2020 23:51

apeforest reviewed May 1, 2020

View reviewed changes

Rohit Kumar Srivastava added 3 commits May 1, 2020 03:24

adding separate int32_t kernel for GPU in broadcast_axis/to/like oper…

57ec8fc

…ators

using structure instead of temp workspace to pass stride and shape

c97160a

replacing hardcoded int32_t with generic index_t

5ce298d

access2rohit force-pushed the typedef branch from 350fe88 to 349f8a6 Compare May 1, 2020 03:25

access2rohit requested review from apeforest and sxjscience May 1, 2020 03:46

combining CPU and GPU kernels to leverage cached stride calculation a…

cda9c1a

…nd fast access shape data in both

access2rohit force-pushed the typedef branch from 349f8a6 to cda9c1a Compare May 1, 2020 03:55

apeforest approved these changes May 1, 2020

View reviewed changes

access2rohit force-pushed the typedef branch from 28e1fe4 to cda9c1a Compare May 1, 2020 08:16

sxjscience approved these changes May 1, 2020

View reviewed changes

szhengac approved these changes May 1, 2020

View reviewed changes

apeforest merged commit 5950d8c into apache:master May 1, 2020

access2rohit mentioned this pull request Jul 27, 2020

Back port optimization to broadcast_axis to MXNet1.x #18773

Merged

3 tasks

Improving performance of broadcast_axis on GPU #18168

Improving performance of broadcast_axis on GPU #18168

Conversation

access2rohit commented Apr 25, 2020 • edited Loading

Description

Checklist

Essentials

Testing

Results

mxnet-bot commented Apr 25, 2020

access2rohit commented Apr 25, 2020

Choose a reason for hiding this comment

access2rohit Apr 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

access2rohit Apr 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

access2rohit commented Apr 29, 2020 • edited Loading

access2rohit commented Apr 29, 2020 • edited Loading

mxnet-bot commented Apr 29, 2020

mxnet-bot commented Apr 29, 2020

access2rohit commented Apr 29, 2020

access2rohit commented Apr 29, 2020

apeforest commented Apr 29, 2020

access2rohit commented Apr 29, 2020 • edited Loading

Choose a reason for hiding this comment

access2rohit Apr 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

access2rohit Apr 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxnet-bot commented Apr 30, 2020

apeforest left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

access2rohit May 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest commented May 1, 2020

apeforest commented May 1, 2020

access2rohit commented May 1, 2020

access2rohit commented May 1, 2020 • edited Loading

apeforest left a comment

Choose a reason for hiding this comment

access2rohit commented Apr 25, 2020 •

edited

Loading

access2rohit Apr 29, 2020 •

edited

Loading

access2rohit Apr 29, 2020 •

edited

Loading

apeforest left a comment •

edited

Loading

access2rohit commented Apr 29, 2020 •

edited

Loading

access2rohit commented Apr 29, 2020 •

edited

Loading

access2rohit commented Apr 29, 2020 •

edited

Loading

access2rohit Apr 29, 2020 •

edited

Loading

access2rohit Apr 29, 2020 •

edited

Loading

access2rohit May 1, 2020 •

edited

Loading

access2rohit commented May 1, 2020 •

edited

Loading