Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) #14713

Merged
merged 33 commits into from May 24, 2019

Conversation

lihaofd
Copy link
Contributor

@lihaofd lihaofd commented Apr 17, 2019

Description

In this PR, it integrated MKLDNN RNN Inference Integration(fp32 lstm and vRNN with tanh and relu)
@pengzhao-intel, @TaoLv , @ciyongch

Feature changes

New features

  • Single layer/Multiple layer and unidirectional/bidirectional inference by mkldnn lstm and vrnn with tanh and relu)

Unit-test changes

  • Using existing test case in test_operator.py to check consistency with original RNN Cell implementation.

Performance

We have tested performance of FusedRNN (USE_MKLDNN = 0 and 1) on our local Skylake-8180 with 1 Sockets and 28 cores. Use MKL as blas lib in this performance test.
Test input size is from DS2 default parameters(seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800). with MKLDNN commit 57e1203092f63941475ec4088ccd3cf609ed9d7a

Layer=1 bidirectional = False

API MKLDNN = 0 (sample/sec) MKLDNN = 1 (sample/sec) speedup
FusedLSTM 255 637 2.5x
FusedvRNN with tanh 989 1449 1.47x
FusedvRNN with relu 1296 1442 1.11x

Layer=5 bidirectional = True

API MKLDNN = 0 (sample/sec) MKLDNN = 1 (sample/sec) speedup
FusedLSTM 26 56 2.15x
FusedvRNN with tanh 83 157 1.9x
FusedvRNN with relu 104 152 1.46x

Checklist

  • Passed code style checking (make lint).
  • All changes have test coverage.
  • Code is well-documented.

@lihaofd lihaofd changed the title MKLDNN RNN Inference Integration(fp32 lstm and vRNN with tanh and relu) MKLDNN RNN Inference Integration(fp32 LSTMand vRNN with tanh and relu) Apr 17, 2019
@lihaofd lihaofd changed the title MKLDNN RNN Inference Integration(fp32 LSTMand vRNN with tanh and relu) MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) Apr 17, 2019
@pengzhao-intel
Copy link
Contributor

@lihaofd We need to upgrade MKL-DNN by a separated PR but you can use this PR for the CI testing.

@Roshrini Roshrini added the pr-awaiting-review PR is waiting for code review label Apr 17, 2019
@pengzhao-intel
Copy link
Contributor

FYI @anirudh2290 @szha we are starting the MKLDNN RNN integration :)
The order of PR is : inference -> training -> INT8 -> ....

@anirudh2290
Copy link
Member

Great to hear! Looking forward to this

@lihaofd lihaofd reopened this Apr 19, 2019
@lihaofd lihaofd force-pushed the mkldnn_lstm_infer_fp32 branch 2 times, most recently from 3769be5 to 5324c93 Compare April 19, 2019 05:40
auto concat_pd = concat::primitive_desc(dst_desc, concat_dimension, srcs_pd);
MKLDNNStream::Get()->RegisterPrim(concat(concat_pd, inputs, dst));
MKLDNNStream::Get()->Submit();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leverage the concat implementation in mkldnn_concat.cc? Do you think the concat primitive here need be cached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many data segments with different size, dim, count and concat_dim etc. It will make mkldnn cache be much more complicated but will not benefit too much on perf

src/operator/nn/mkldnn/mkldnn_rnn_impl.h Outdated Show resolved Hide resolved
src/operator/nn/mkldnn/mkldnn_rnn_impl.h Outdated Show resolved Hide resolved
mkldnn::memory::dims dst_iter_tz = {1, 2, nstates, N, H}; // ldsnc

std::vector<float> weights_scales(ngates * H);
if (!cached) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's cached? How is it cached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the first time, it will did data preparation like wx, wh concat/reorder etc for multiple layers unidirectional or bidirectional ways and saved them into mkldnn cached memory. From next time call, these data be used directly.

src/operator/nn/mkldnn/mkldnn_rnn_impl.h Outdated Show resolved Hide resolved
src/operator/rnn.cc Outdated Show resolved Hide resolved
@pengzhao-intel
Copy link
Contributor

@TaoLv please take a review.
@DickJC123 could you help to take a review if our change is fine for GPU?

@pengzhao-intel pengzhao-intel moved this from In progress to Review in progress in CPU Performance and Quantization May 22, 2019
@pengzhao-intel
Copy link
Contributor

The PR is almost done and we're waiting for the local test.
@TaoLv will update the results soon.

@pengzhao-intel pengzhao-intel changed the title [WIP]MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) May 23, 2019
hy_ptr,
cy_ptr,
param_.mode);
#if MXNET_USE_MKLDNN == 1 && !defined(__CUDACC__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this CUDACC forgot or intend to leave?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it can be removed also. @zixuanweeei

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been modified. Thanks.

CPU Performance and Quantization automation moved this from Review in progress to Reviewer approved May 23, 2019
Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration looks good to me in general. We can revisit the GRU integration and training part in following PRs.

hy_ptr,
cy_ptr,
param_.mode);
#if MXNET_USE_MKLDNN == 1 && !defined(__CUDACC__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it can be removed also. @zixuanweeei

cy_ptr,
param_.mode);
#if MXNET_USE_MKLDNN == 1 && !defined(__CUDACC__)
if (dmlc::GetEnv("MXNET_USE_MKLDNN_RNN", 1) && param_.mode != rnn_enum::kGru) {
Copy link
Member

@TaoLv TaoLv May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szha Please review. We add a new environmental variable here. Once it's set to 0, RNN operator will fall back to the original version on CPU. Otherwise, MKL-DNN RNN primitive will be invoked.

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it's good even it's not perfect till now.

Our team will continuously work to improve RNN interface :)

@zixuanweeei
Copy link
Contributor

We also test the performances over GPU of our PR and the master. Here is the the result. The relative DIFF is calculated by (Our_PR - MASTER) / MASTER. In summary, Our modifications have no significant damage to the performance over GPU.

Layer = 1, bidirectional = False

API Our PR (sample/sec) MASTER (sample/sec) DIFF (sample/sec) Relative DIFF
FusedLSTM 1038 1058 -20 -1.89%
FusedvRNN with tanh 1961 1884 77 4.09%
FusedvRNN with relu 1926 1939 -13 -0.67%

Layer = 1, bidirectional = True

API Our PR (sample/sec) MASTER (sample/sec) DIFF (sample/sec) Relative DIFF
FusedLSTM 683 694 -11 -1.59%
FusedvRNN with tanh 1221 1190 31 2.61%
FusedvRNN with relu 1212 1209 3 0.25%

Layer = 5, bidirectional = False

API Our PR (sample/sec) MASTER (sample/sec) DIFF (sample/sec) Relative DIFF
FusedLSTM 322 320 2 0.63%
FusedvRNN with tanh 676 649 27 4.16%
FusedvRNN with relu 670 637 33 5.18%

Layer = 5, bidirectional = True

API Our PR (sample/sec) MASTER (sample/sec) DIFF (sample/sec) Relative DIFF
FusedLSTM 110 109 1 0.92%
FusedvRNN with tanh 210 209 1 0.48%
FusedvRNN with relu 212 210 2 0.95%

@DickJC123
Copy link
Contributor

I think it prudent to resolve the GPU platform issues with rnn-inl.h introduced by commit 1c49e40 before finally accepting this PR [see https://github.com//issues/15034]. Besides introducing test failures of test_rnntanh_bidirectional on P40 GPUs, I have noticed that the codebase no longer compiles against cuDNN versions < 7.0. I intend to submit a PR to resolve both these issues within 24 hours, probably less.

@pengzhao-intel
Copy link
Contributor

The PR is ready to be merged.

@szha @DickJC123 do we need to wait for #15056 ?

@pengzhao-intel
Copy link
Contributor

Merging this one first since the #15056 WIP.

@pengzhao-intel pengzhao-intel merged commit 653cbb4 into apache:master May 24, 2019
CPU Performance and Quantization automation moved this from Reviewer approved to Done May 24, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…u) (apache#14713)

* trigger the ci

* integrate mkldnn rnn fp32 inference(LSTM and vRNN with tanh and relu)

* fix bug about comparison between signed and unsigned integer expressions

* fix unix-gpu issue

* fix unix gpu bug

* fix unix-gpu issues

* fix some comments

* fix issue

* fix comment

* rename `cached` to `initialized`

* support IType

* TODO for MKLDNN GRU

* fix bugs in memory adjustment

* Reformat TODO for MKLDNN GRU

* Reserve original RNN path

* Remove MKLDNN GRU

* Fix bug for rnn forward

* Remove `__CUDAACC__`

* Move `RNNStatefulComputeCPU` to rnn.cc

* Remove redundent macro of `__CUDACC__`

* Remove the last macro `__CUDACC__` from rnn*
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-work-in-progress PR is still work in progress
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet