Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Improving performance of broadcast_axis on GPU #18168

Merged
merged 4 commits into from
May 1, 2020

Conversation

access2rohit
Copy link
Contributor

@access2rohit access2rohit commented Apr 25, 2020

Description

improving GPU kernel performance with cached stride calculations and shape data to improve performance.

  • passing mshadow::Shape slowed the kernel speed by approx 15% so passed it separately in a struct.
  • unroll gives boost of about 4%
  • major chunk of improvement by caching stride calculations and combining 2 redundant multiplications

This change is done in order to maintain BERT performance when large tensor is enabled by default.

CPU performance remains identical or better than master-lt and master respectively(no changes there)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Testing

ubuntu@ip-172-31-0-156 ~/workspace/incubator-mxnet (typedef) $ MXNET_TEST_COUNT=1 nosetests --logging-level=DEBUG --verbose -s tests/python/gpu/test_operator_gpu.py:test_broadcast

[DEBUG] 1000 of 1000: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1707871653 to reproduce.
ok

----------------------------------------------------------------------
Ran 1 test in 1772.105s

OK

Results

Profiling Code:

import mxnet as mx
import mxnet.ndarray as nd
from benchmark.opperf.utils.benchmark_utils import run_performance_test

#local_ctx = mx.cpu()
local_ctx = mx.gpu()


add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=local_ctx,
                               inputs=[{'data': (10, 1, 10, 1), 'axis': (1, 3), 'size': (1000, 5000)}],
                               warmup=50, runs=500, profiler='python')
print(add_res)

add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=local_ctx,
                               inputs=[{'data': (100, 1, 1, 10), 'axis': (1, 2), 'size': (1000, 500)}],
                               warmup=50, runs=500, profiler='python')
print(add_res)

add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=local_ctx,
                               inputs=[{'data': (1, 1, 1, 1), 'axis': (0, 1, 2, 3), 'size': (10000, 10, 100, 50)}],
                               warmup=50, runs=500, profiler='python')
print(add_res)

add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=local_ctx,
                               inputs=[{'data': (1, 10, 1, 10, 1), 'axis': (0, 2, 4), 'size': (200, 1000, 50)}],
                               warmup=50, runs=500, profiler='python')
print(add_res)
code version cases avg p50 p90
master LT (10, 1, 10, 1) -> (10, 1000, 10, 5000) 65.75 65.71 65.82
(100, 1, 1, 10)->(100, 1000, 500, 10) 64.65 64.57 64.68
(1, 1, 1, 1)->(10000, 10, 100, 50) 46.61 46.54 46.64
(1, 10, 1, 10, 1)->(200, 10, 1000, 10, 50) 128.52 128.49 128.71
master no-LT (10, 1, 10, 1) -> (10, 1000, 10, 5000) 28.42 28.37 28.45
(100, 1, 1, 10)->(100, 1000, 500, 10) 27.38 27.30 27.37
(1, 1, 1, 1)->(10000, 10, 100, 50) 20.48 20.35 20.41
(1, 10, 1, 10, 1)->(200, 10, 1000, 10, 50) 53.68 53.57 53.74
new LT (10, 1, 10, 1) -> (10, 1000, 10, 5000) 27.24 27.10 27.18
(100, 1, 1, 10)->(100, 1000, 500, 10) 25.42 25.34 25.40
(1, 1, 1, 1)->(10000, 10, 100, 50) 14.99 14.83 14.93
(1, 10, 1, 10, 1)->(200, 10, 1000, 10, 50) 51.66 51.80 51.90
new no-LT (10, 1, 10, 1) -> (10, 1000, 10, 5000) 22.18 22.10 22.18
(100, 1, 1, 10)->(100, 1000, 500, 10) 20.28 20.27 20.32
(1, 1, 1, 1)->(10000, 10, 100, 50) 12.73 12.69 12.73
(1, 10, 1, 10, 1)->(200, 10, 1000, 10, 50) 40.21 40.11 40.28

@mxnet-bot
Copy link

Hey @access2rohit , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, edge, centos-cpu, unix-cpu, unix-gpu, windows-cpu, miscellaneous, website, sanity, centos-gpu, clang]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@access2rohit
Copy link
Contributor Author

@mxnet-label-bot add [pr-work-in-progress]

@lanking520 lanking520 added the pr-work-in-progress PR is still work in progress label Apr 25, 2020
@access2rohit access2rohit force-pushed the typedef branch 3 times, most recently from 0c0d9a4 to 4a9d417 Compare April 28, 2020 08:05
const int32_t ndim) {
int32_t idx = i;
int32_t in_idx = i;
#pragma unroll 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you benchmark if unroll makes a measurable difference? Background: currently libmxnet.so is quite large, and we may want to remove loop unrolling and force inline to reduce the size

Copy link
Contributor Author

@access2rohit access2rohit Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the loop has only 4 iterations. @ptrendx suggested that when using array from struct compiler doesn't know the index at compile time.
So unroll can give better performance.

@leezu To answer your question : yes, there is improvement in performance when unrolling. Slightly more when using unroll 4compared to unroll

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@access2rohit Could you provide data. How much performance improvement with unroll and unroll 4?

Also the index_t in this kernel is still int64_t?

Copy link
Contributor Author

@access2rohit access2rohit Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apeforest
The performance gain is about 4% with unroll.
I wanted to keep the implementation consistent with existing code so I kept it index_t. Since the target was to make sure LTS perf >= master no-LTS performance with the new approach it still there.

I can also hard code every variable inside GPU kernel to be int32_t and there will be 10-15% further gain for LTS MXNet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing to mix index_t with int in this kernel since your variable i in the Map function is int

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah ... my bad will fix it

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your PR description, the only difference between CPU and GPU kernel is that GPU kernel uses int32_t for indexing and CPU kernel uses int64_t (when large tensor compiler flag is on) for indexing, duplicating the kernel implementation into two structs seems an overdo to me. Could we optimize the code such that they still share the same kernel template but with different index_t type? Also it will be nice to re-architect the code so that it applies to all operators that share the same template between CPU and GPU kernels.

const int32_t ndim) {
int32_t idx = i;
int32_t in_idx = i;
#pragma unroll 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@access2rohit Could you provide data. How much performance improvement with unroll and unroll 4?

Also the index_t in this kernel is still int64_t?

@access2rohit
Copy link
Contributor Author

access2rohit commented Apr 29, 2020

Based on your PR description, the only difference between CPU and GPU kernel is that GPU kernel uses int32_t for indexing and CPU kernel uses int64_t (when large tensor compiler flag is on) for indexing, duplicating the kernel implementation into two structs seems an overdo to me. Could we optimize the code such that they still share the same kernel template but with different index_t type? Also it will be nice to re-architect the code so that it applies to all operators that share the same template between CPU and GPU kernels.

@apeforest

  • We cannot use typedef based on xpu because typdef is compile time option.
  • The blast radius is large if we make generic CPU/GPU kernel. Numpy ops that call broadcast_axis kernel for their internal operations need to be updated as well, so i created broadcast_axis_gpu for ops relying only on BroadcastComputeImpl internally which is evident from 43 GPU failing tests when temp workspace was used.

Hence, 2 structs are not overdo but necessary to reduce impact otherwise changes would require performing stride calculations inside implementation of approx 15 operators thereby significantly increasing changes required.

Again typedef is compile time and cannot be dynamically decided.

Updated the description now. It wasn't consistent as the PR was WIP. And now it is ready for review !

@access2rohit
Copy link
Contributor Author

access2rohit commented Apr 29, 2020

@mxnet-bot run ci [macosx-x86_64, edge, miscellaneous, clang]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu, clang, unix-gpu, miscellaneous, edge]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [edge, clang, miscellaneous]

@access2rohit access2rohit changed the title [WIP]separate GPU kernel for broadcast_axis Separate GPU kernel for broadcast_axis Apr 29, 2020
@access2rohit
Copy link
Contributor Author

@mxnet-label-bot update [pr-awaiting-review]

@lanking520 lanking520 added pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress labels Apr 29, 2020
@access2rohit
Copy link
Contributor Author

@ptrendx can you review this PR ?

@apeforest
Copy link
Contributor

Based on your PR description, the only difference between CPU and GPU kernel is that GPU kernel uses int32_t for indexing and CPU kernel uses int64_t (when large tensor compiler flag is on) for indexing, duplicating the kernel implementation into two structs seems an overdo to me. Could we optimize the code such that they still share the same kernel template but with different index_t type? Also it will be nice to re-architect the code so that it applies to all operators that share the same template between CPU and GPU kernels.

@apeforest

  • We cannot use typedef based on xpu because typdef is compile time option.
  • The blast radius is large if we make generic CPU/GPU kernel. Numpy ops that call broadcast_axis kernel for their internal operations need to be updated as well, so i created broadcast_axis_gpu for ops relying only on BroadcastComputeImpl internally which is evident from 43 GPU failing tests when temp workspace was used.

Hence, 2 structs are not overdo but necessary to reduce impact otherwise changes would require performing stride calculations inside implementation of approx 15 operators thereby significantly increasing changes required.

Again typedef is compile time and cannot be dynamically decided.

Updated the description now. It wasn't consistent as the PR was WIP. And now it is ready for review !

I did not mean typedef the only way to do so. Can we leverage the template feature. In fact, the Map function in GPU kernel is always using int instead of index_t

@access2rohit
Copy link
Contributor Author

access2rohit commented Apr 29, 2020

Based on your PR description, the only difference between CPU and GPU kernel is that GPU kernel uses int32_t for indexing and CPU kernel uses int64_t (when large tensor compiler flag is on) for indexing, duplicating the kernel implementation into two structs seems an overdo to me. Could we optimize the code such that they still share the same kernel template but with different index_t type? Also it will be nice to re-architect the code so that it applies to all operators that share the same template between CPU and GPU kernels.

@apeforest

  • We cannot use typedef based on xpu because typdef is compile time option.
  • The blast radius is large if we make generic CPU/GPU kernel. Numpy ops that call broadcast_axis kernel for their internal operations need to be updated as well, so i created broadcast_axis_gpu for ops relying only on BroadcastComputeImpl internally which is evident from 43 GPU failing tests when temp workspace was used.

Hence, 2 structs are not overdo but necessary to reduce impact otherwise changes would require performing stride calculations inside implementation of approx 15 operators thereby significantly increasing changes required.
Again typedef is compile time and cannot be dynamically decided.
Updated the description now. It wasn't consistent as the PR was WIP. And now it is ready for review !

I did not mean typedef the only way to do so. Can we leverage the template feature. In fact, the Map function in GPU kernel is always using int instead of index_t

@apeforest : kernels can be fused together using templates. But this approach hinders CPU performance but there is another PR that actually boosts it for CPU: #17882 . Since there I am leveraging vectorization.

for (index_t iter = ndim - 1; iter >= 0; --iter) {
index_t out_dim_shape = aux_data.output_shape[iter];
index_t out_dim_stride = aux_data.out_stride[iter];
index_t dim_idx = idx - (idx / out_dim_shape) * out_dim_shape;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really improve performance compared to using %?

Copy link
Contributor Author

@access2rohit access2rohit Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very little not more than 2%. But if you look at nvprof for "ALU instructions issued" they are reduced in number

Kernel<broadcast_kernel<mshadow_op::identity>, xpu>::Launch(
s, out.shape_.Size(), data.dptr_, out.dptr_, in_shape, out_shape, req[0], 2);
if (ctx.run_ctx.get_ctx().dev_type == Context::kGPU) {
Kernel<broadcast_kernel_gpu<mshadow_op::identity>, xpu>::Launch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed something, but the gpu and cpu kernels look highly similar. Adding a branch here makes the code very fragmented since this entire function template<typename xpu> inline void BroadcastComputeImpl is supposed to be generic template for both CPU and GPU.

I think we need a better way to code this. Two approaches I can think of
(1) consolidate cpu and gpu kernel
(2) instead of a generic template, break it into two different functions.

(1) is preferred in my opinion.

@@ -128,7 +128,7 @@ NNVM_REGISTER_OP(_broadcast_backward)
.set_attr<FResourceRequest>("FResourceRequest",
[](const NodeAttrs& attrs) {
return std::vector<ResourceRequest>{ResourceRequest::kTempSpace};
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing indent here?

Copy link
Contributor Author

@access2rohit access2rohit Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent wasn't supposed to be there. It must align with NNVM. For example: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/matrix_op.cc#L446-L491

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why must it align with NNVM? There is no open braces for NNVM. This closing braces is matched to the set_attr function.
Please see https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/matrix_op.cc#L751

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saw other examples in the codebase. Makes sense !

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu, sanity, clang, edge, website, centos-gpu, centos-cpu, miscellaneous, windows-gpu, unix-cpu, windows-cpu]

@access2rohit access2rohit changed the title Separate GPU kernel for broadcast_axis Improving performance of broadcast_axis on GPU Apr 30, 2020
Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much nicer. Only a few comments about styling. Otherwise good to go. Thanks for your effort.

@@ -1049,29 +1049,55 @@ void ReduceAxesBackwardUseInOut(const nnvm::NodeAttrs& attrs,
ReduceAxesBackwardUseInOutImpl<xpu, OP, normalize>(ctx, small, inputs, req, outputs);
}

namespace { // unnamed namespace to keep scope of the struct within the file
struct shape_and_stride {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use CamelCase naming convention for struct. See this: https://google.github.io/styleguide/cppguide.html#Type_Names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. But its confusing that when struct is used as kernel then its using _ but when just as struct it is camel cased in our code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @apeforest in terms of the CamelCase standard.

src/operator/tensor/broadcast_reduce_op.h Outdated Show resolved Hide resolved
src/operator/tensor/broadcast_reduce_op.h Outdated Show resolved Hide resolved
if (in_shape[iter] != 1) {
in_idx += dim_idx * in_stride;
#pragma unroll 4
for (index_t iter = ndim - 1; iter >= 0; --iter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iter is int type

for (index_t iter = ndim - 1; iter >= 0; --iter) {
index_t out_dim_shape = aux_data.output_shape[iter];
index_t out_dim_stride = aux_data.out_stride[iter];
index_t dim_idx = idx - (idx / out_dim_shape) * out_dim_shape;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think compiler can optimize % to this. Let's keep modulo here for better readability.

Copy link
Contributor Author

@access2rohit access2rohit May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiler doesn't do that:
nvcc is not intelligent enough to do this
gcc doesn't need to since this doesn't slowdown CPU performance in any measurable way.

I have added a comment to explain that this is modulo operation.

aux_data->output_shape[iter] = out_shape[iter];
iter--;
for (; iter >= 0; --iter) {
aux_data->out_stride[iter] = aux_data->out_stride[iter+1] * out_shape[iter+1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add spaces around operators like +

@apeforest
Copy link
Contributor

With the latest change, could you please re-run the opperf and paste the latest performance results?

@apeforest
Copy link
Contributor

@ptrendx Your review will be appreciated.

@access2rohit
Copy link
Contributor Author

With the latest change, could you please re-run the opperf and paste the latest performance results?

@apeforest Actually there isn't any change in the result after re-running opperf I kept updating description if there were any differences in results. But currently they are consistent with the ones in description. I will paste BERT results in few mins.

@access2rohit
Copy link
Contributor Author

access2rohit commented May 1, 2020

@eric-haibin-lin @sxjscience @apeforest @szhengac

BERT (Training Run Single GPU on p3.16xl)
Command:

python3 run_pretraining.py --data='./part-0000.train' --data_eval='./part-0000.train' --num_steps 1000 --lr 1e-4 --optimizer lamb --accumulate 1 --raw --gpus 0 --num_dataset_workers 2 --num_batch_workers 1 --circle_length 1 --total_batch_size 8 --total_batch_size_eval 8 --log_interval 10

new no-LT

INFO:root:[step 950]    mlm_loss=8.22240    mlm_acc=7.16    nsp_loss= 0.72  nsp_acc=48.75   throughput=26.4K tks/s  lr=0.0000051 time=1.06, latency=105.9 ms/step
INFO:root:[step 960]    mlm_loss=8.21638    mlm_acc=7.63    nsp_loss= 0.66  nsp_acc=56.25   throughput=25.3K tks/s  lr=0.0000040 time=1.06, latency=105.8 ms/step
INFO:root:[step 970]    mlm_loss=8.15530    mlm_acc=8.08    nsp_loss= 0.64  nsp_acc=53.75   throughput=26.5K tks/s  lr=0.0000030 time=1.07, latency=106.6 ms/step
INFO:root:[step 980]    mlm_loss=8.19152    mlm_acc=7.63    nsp_loss= 0.66  nsp_acc=68.75   throughput=21.2K tks/s  lr=0.0000020 time=1.06, latency=106.5 ms/step
INFO:root:[step 990]    mlm_loss=8.20629    mlm_acc=7.22    nsp_loss= 0.63  nsp_acc=65.00   throughput=27.7K tks/s  lr=0.0000010 time=1.08, latency=108.0 ms/step
INFO:root:[step 1000]   mlm_loss=8.22327    mlm_acc=7.58    nsp_loss= 0.62  nsp_acc=62.50   throughput=26.4K tks/s  lr=0.0000000 time=1.07, latency=107.3 ms/step
INFO:root:[step 1000] Saving trainer states to ./ckpt_dir/0001000.states.00.
INFO:root:[step 1000] Saving model params to ./ckpt_dir/0001000.params.
INFO:root:Train cost=131.5s

new LT

INFO:root:[step 950]    mlm_loss=8.21955    mlm_acc=7.49    nsp_loss= 0.68  nsp_acc=56.25   throughput=26.2K tks/s  lr=0.0000051 time=1.06, latency=106.3 ms/step
INFO:root:[step 960]    mlm_loss=8.19205    mlm_acc=7.28    nsp_loss= 0.68  nsp_acc=61.25   throughput=28.1K tks/s  lr=0.0000040 time=1.07, latency=106.9 ms/step
INFO:root:[step 970]    mlm_loss=8.13653    mlm_acc=8.54    nsp_loss= 0.67  nsp_acc=61.25   throughput=24.6K tks/s  lr=0.0000030 time=1.06, latency=106.0 ms/step
INFO:root:[step 980]    mlm_loss=8.12760    mlm_acc=8.43    nsp_loss= 0.64  nsp_acc=67.50   throughput=26.1K tks/s  lr=0.0000020 time=1.07, latency=106.9 ms/step
INFO:root:[step 990]    mlm_loss=8.10771    mlm_acc=8.24    nsp_loss= 0.68  nsp_acc=45.00   throughput=25.9K tks/s  lr=0.0000010 time=1.07, latency=107.0 ms/step
INFO:root:[step 1000]   mlm_loss=8.16389    mlm_acc=7.53    nsp_loss= 0.63  nsp_acc=58.75   throughput=28.0K tks/s  lr=0.0000000 time=1.07, latency=106.9 ms/step
INFO:root:[step 1000] Saving trainer states to ./ckpt_dir/0001000.states.00.
INFO:root:[step 1000] Saving model params to ./ckpt_dir/0001000.params.
INFO:root:Train cost=127.3s

master no-LT

INFO:root:[step 950]    mlm_loss=8.27928    mlm_acc=7.14    nsp_loss= 0.67  nsp_acc=53.75   throughput=25.5K tks/s  lr=0.0000051 time=1.06, latency=105.9 ms/step
INFO:root:[step 960]    mlm_loss=8.21552    mlm_acc=6.83    nsp_loss= 0.63  nsp_acc=57.50   throughput=25.0K tks/s  lr=0.0000040 time=1.06, latency=106.1 ms/step
INFO:root:[step 970]    mlm_loss=8.17274    mlm_acc=7.57    nsp_loss= 0.65  nsp_acc=63.75   throughput=24.0K tks/s  lr=0.0000030 time=1.07, latency=106.5 ms/step
INFO:root:[step 980]    mlm_loss=8.20168    mlm_acc=7.15    nsp_loss= 0.66  nsp_acc=52.50   throughput=29.1K tks/s  lr=0.0000020 time=1.07, latency=106.5 ms/step
INFO:root:[step 990]    mlm_loss=8.21061    mlm_acc=7.20    nsp_loss= 0.66  nsp_acc=57.50   throughput=25.9K tks/s  lr=0.0000010 time=1.06, latency=106.1 ms/step
INFO:root:[step 1000]   mlm_loss=8.14879    mlm_acc=7.44    nsp_loss= 0.66  nsp_acc=57.50   throughput=25.1K tks/s  lr=0.0000000 time=1.07, latency=107.0 ms/step
INFO:root:[step 1000] Saving trainer states to ./ckpt_dir/0001000.states.00.
INFO:root:[step 1000] Saving model params to ./ckpt_dir/0001000.params.
INFO:root:Train cost=131.9s

master-LT

INFO:root:[step 950]    mlm_loss=8.15209    mlm_acc=7.47    nsp_loss= 0.63  nsp_acc=55.00   throughput=26.0K tks/s  lr=0.0000051 time=1.08, latency=108.3 ms/step
INFO:root:[step 960]    mlm_loss=8.25082    mlm_acc=7.79    nsp_loss= 0.67  nsp_acc=63.75   throughput=22.6K tks/s  lr=0.0000040 time=1.09, latency=108.6 ms/step
INFO:root:[step 970]    mlm_loss=8.20299    mlm_acc=7.08    nsp_loss= 0.68  nsp_acc=62.50   throughput=26.6K tks/s  lr=0.0000030 time=1.09, latency=108.7 ms/step
INFO:root:[step 980]    mlm_loss=8.10018    mlm_acc=8.48    nsp_loss= 0.70  nsp_acc=52.50   throughput=26.2K tks/s  lr=0.0000020 time=1.09, latency=108.7 ms/step
INFO:root:[step 990]    mlm_loss=8.27777    mlm_acc=7.00    nsp_loss= 0.68  nsp_acc=51.25   throughput=22.3K tks/s  lr=0.0000010 time=1.08, latency=108.1 ms/step
INFO:root:[step 1000]   mlm_loss=8.22438    mlm_acc=7.22    nsp_loss= 0.63  nsp_acc=58.75   throughput=23.8K tks/s  lr=0.0000000 time=1.08, latency=108.3 ms/step
INFO:root:[step 1000] Saving trainer states to ./ckpt_dir/0001000.states.00.
INFO:root:[step 1000] Saving model params to ./ckpt_dir/0001000.params.
INFO:root:Train cost=134.8s

Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@apeforest apeforest merged commit 5950d8c into apache:master May 1, 2020
AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators

* using structure instead of temp workspace to pass stride and shape

* replacing hardcoded int32_t with generic index_t

* combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both

Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Jul 23, 2020
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators

* using structure instead of temp workspace to pass stride and shape

* replacing hardcoded int32_t with generic index_t

* combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both

Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Jul 23, 2020
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators

* using structure instead of temp workspace to pass stride and shape

* replacing hardcoded int32_t with generic index_t

* combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both

Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Jul 24, 2020
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators

* using structure instead of temp workspace to pass stride and shape

* replacing hardcoded int32_t with generic index_t

* combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both

Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Jul 27, 2020
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators

* using structure instead of temp workspace to pass stride and shape

* replacing hardcoded int32_t with generic index_t

* combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both

Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Jul 28, 2020
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators

* using structure instead of temp workspace to pass stride and shape

* replacing hardcoded int32_t with generic index_t

* combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both

Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
szha pushed a commit that referenced this pull request Jul 28, 2020
* Improving performance of broadcast_axis on GPU (#18168)

* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators

* using structure instead of temp workspace to pass stride and shape

* replacing hardcoded int32_t with generic index_t

* combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both

Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>

* Improve performance of broadcast_axis on CPU (#17882)

* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>

Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants