MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) #14713

lihaofd · 2019-04-17T01:18:59Z

Description

In this PR, it integrated MKLDNN RNN Inference Integration(fp32 lstm and vRNN with tanh and relu)
@pengzhao-intel, @TaoLv , @ciyongch

Feature changes

New features

Single layer/Multiple layer and unidirectional/bidirectional inference by mkldnn lstm and vrnn with tanh and relu)

Unit-test changes

Using existing test case in test_operator.py to check consistency with original RNN Cell implementation.

Performance

We have tested performance of FusedRNN (USE_MKLDNN = 0 and 1) on our local Skylake-8180 with 1 Sockets and 28 cores. Use MKL as blas lib in this performance test.
Test input size is from DS2 default parameters(seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800). with MKLDNN commit 57e1203092f63941475ec4088ccd3cf609ed9d7a

Layer=1 bidirectional = False

API	MKLDNN = 0 (sample/sec)	MKLDNN = 1 (sample/sec)	speedup
FusedLSTM	255	637	2.5x
FusedvRNN with tanh	989	1449	1.47x
FusedvRNN with relu	1296	1442	1.11x

Layer=5 bidirectional = True

API	MKLDNN = 0 (sample/sec)	MKLDNN = 1 (sample/sec)	speedup
FusedLSTM	26	56	2.15x
FusedvRNN with tanh	83	157	1.9x
FusedvRNN with relu	104	152	1.46x

Checklist

Passed code style checking (make lint).
All changes have test coverage.
Code is well-documented.

pengzhao-intel · 2019-04-17T03:41:13Z

@lihaofd We need to upgrade MKL-DNN by a separated PR but you can use this PR for the CI testing.

pengzhao-intel · 2019-04-18T00:23:15Z

FYI @anirudh2290 @szha we are starting the MKLDNN RNN integration :)
The order of PR is : inference -> training -> INT8 -> ....

anirudh2290 · 2019-04-18T00:27:13Z

Great to hear! Looking forward to this

sync code

sync code to latest

TaoLv · 2019-04-21T13:17:04Z

src/operator/nn/mkldnn/mkldnn_rnn_impl.h

+  auto concat_pd = concat::primitive_desc(dst_desc, concat_dimension, srcs_pd);
+  MKLDNNStream::Get()->RegisterPrim(concat(concat_pd, inputs, dst));
+  MKLDNNStream::Get()->Submit();
+}


Can we leverage the concat implementation in mkldnn_concat.cc? Do you think the concat primitive here need be cached?

There are too many data segments with different size, dim, count and concat_dim etc. It will make mkldnn cache be much more complicated but will not benefit too much on perf

src/operator/nn/mkldnn/mkldnn_rnn_impl.h

TaoLv · 2019-04-21T13:23:55Z

src/operator/nn/mkldnn/mkldnn_rnn_impl.h

+  mkldnn::memory::dims dst_iter_tz = {1, 2, nstates, N, H};  //  ldsnc
+
+  std::vector<float> weights_scales(ngates * H);
+  if (!cached) {


What's cached? How is it cached?

At the first time, it will did data preparation like wx, wh concat/reorder etc for multiple layers unidirectional or bidirectional ways and saved them into mkldnn cached memory. From next time call, these data be used directly.

src/operator/nn/mkldnn/mkldnn_rnn_impl.h

src/operator/rnn.cc

sync to latest code

fix min max on zero-sized ndarray (apache#14745)

sync to lastest code

sync code to latest

sync to latest code

…to mkldnn_lstm_infer_fp32

pengzhao-intel · 2019-05-22T06:26:30Z

@TaoLv please take a review.
@DickJC123 could you help to take a review if our change is fine for GPU?

pengzhao-intel · 2019-05-23T01:57:06Z

The PR is almost done and we're waiting for the local test.
@TaoLv will update the results soon.

…to mkldnn_lstm_infer_fp32

pengzhao-intel · 2019-05-23T12:12:58Z

src/operator/rnn-inl.h

-                                   hy_ptr,
-                                   cy_ptr,
-                                   param_.mode);
+        #if MXNET_USE_MKLDNN == 1 && !defined(__CUDACC__)


Is this CUDACC forgot or intend to leave?

Yes, I think it can be removed also. @zixuanweeei

It has been modified. Thanks.

TaoLv

The integration looks good to me in general. We can revisit the GRU integration and training part in following PRs.

TaoLv · 2019-05-23T14:18:36Z

src/operator/rnn-inl.h

-                                   hy_ptr,
-                                   cy_ptr,
-                                   param_.mode);
+        #if MXNET_USE_MKLDNN == 1 && !defined(__CUDACC__)


Yes, I think it can be removed also. @zixuanweeei

TaoLv · 2019-05-23T14:21:30Z

src/operator/rnn-inl.h

-                                   cy_ptr,
-                                   param_.mode);
+        #if MXNET_USE_MKLDNN == 1 && !defined(__CUDACC__)
+        if (dmlc::GetEnv("MXNET_USE_MKLDNN_RNN", 1) && param_.mode != rnn_enum::kGru) {


@szha Please review. We add a new environmental variable here. Once it's set to 0, RNN operator will fall back to the original version on CPU. Otherwise, MKL-DNN RNN primitive will be invoked.

src/operator/rnn-inl.h

pengzhao-intel

In general, it's good even it's not perfect till now.

Our team will continuously work to improve RNN interface :)

zixuanweeei · 2019-05-23T15:36:24Z

We also test the performances over GPU of our PR and the master. Here is the the result. The relative DIFF is calculated by (Our_PR - MASTER) / MASTER. In summary, Our modifications have no significant damage to the performance over GPU.

Layer = 1, bidirectional = False

API	Our PR (sample/sec)	MASTER (sample/sec)	DIFF (sample/sec)	Relative DIFF
FusedLSTM	1038	1058	-20	-1.89%
FusedvRNN with tanh	1961	1884	77	4.09%
FusedvRNN with relu	1926	1939	-13	-0.67%

Layer = 1, bidirectional = True

API	Our PR (sample/sec)	MASTER (sample/sec)	DIFF (sample/sec)	Relative DIFF
FusedLSTM	683	694	-11	-1.59%
FusedvRNN with tanh	1221	1190	31	2.61%
FusedvRNN with relu	1212	1209	3	0.25%

Layer = 5, bidirectional = False

API	Our PR (sample/sec)	MASTER (sample/sec)	DIFF (sample/sec)	Relative DIFF
FusedLSTM	322	320	2	0.63%
FusedvRNN with tanh	676	649	27	4.16%
FusedvRNN with relu	670	637	33	5.18%

Layer = 5, bidirectional = True

API	Our PR (sample/sec)	MASTER (sample/sec)	DIFF (sample/sec)	Relative DIFF
FusedLSTM	110	109	1	0.92%
FusedvRNN with tanh	210	209	1	0.48%
FusedvRNN with relu	212	210	2	0.95%

DickJC123 · 2019-05-23T18:25:08Z

I think it prudent to resolve the GPU platform issues with rnn-inl.h introduced by commit 1c49e40 before finally accepting this PR [see https://github.com//issues/15034]. Besides introducing test failures of test_rnntanh_bidirectional on P40 GPUs, I have noticed that the codebase no longer compiles against cuDNN versions < 7.0. I intend to submit a PR to resolve both these issues within 24 hours, probably less.

pengzhao-intel · 2019-05-24T06:40:21Z

The PR is ready to be merged.

@szha @DickJC123 do we need to wait for #15056 ?

pengzhao-intel · 2019-05-24T22:50:03Z

Merging this one first since the #15056 WIP.

…u) (apache#14713) * trigger the ci * integrate mkldnn rnn fp32 inference(LSTM and vRNN with tanh and relu) * fix bug about comparison between signed and unsigned integer expressions * fix unix-gpu issue * fix unix gpu bug * fix unix-gpu issues * fix some comments * fix issue * fix comment * rename `cached` to `initialized` * support IType * TODO for MKLDNN GRU * fix bugs in memory adjustment * Reformat TODO for MKLDNN GRU * Reserve original RNN path * Remove MKLDNN GRU * Fix bug for rnn forward * Remove `__CUDAACC__` * Move `RNNStatefulComputeCPU` to rnn.cc * Remove redundent macro of `__CUDACC__` * Remove the last macro `__CUDACC__` from rnn*

lihaofd changed the title ~~MKLDNN RNN Inference Integration(fp32 lstm and vRNN with tanh and relu)~~ MKLDNN RNN Inference Integration(fp32 LSTMand vRNN with tanh and relu) Apr 17, 2019

lihaofd changed the title ~~MKLDNN RNN Inference Integration(fp32 LSTMand vRNN with tanh and relu)~~ MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) Apr 17, 2019

lihaofd force-pushed the mkldnn_lstm_infer_fp32 branch from 14a947b to f747503 Compare April 17, 2019 05:31

Roshrini added the pr-awaiting-review PR is waiting for code review label Apr 17, 2019

lihaofd closed this Apr 19, 2019

lihaofd force-pushed the mkldnn_lstm_infer_fp32 branch from d336701 to 1f84682 Compare April 19, 2019 01:55

Hao Li added 2 commits April 19, 2019 09:58

Merge pull request #26 from apache/master

f649c87

sync code

trigger the ci

5324c93

lihaofd reopened this Apr 19, 2019

lihaofd force-pushed the mkldnn_lstm_infer_fp32 branch 2 times, most recently from 3769be5 to 5324c93 Compare April 19, 2019 05:40

Li, Hao H added 6 commits April 19, 2019 13:53

integrate mkldnn rnn fp32 inference(LSTM and vRNN with tanh and relu)

b1c3d54

fix bug about comparison between signed and unsigned integer expressions

d398a1c

fix unix-gpu issue

23a6e10

fix unix gpu bug

18f51ac

fix unix-gpu issues

6621343

Merge pull request #27 from apache/master

bdc37e8

sync code to latest

TaoLv reviewed Apr 21, 2019

View reviewed changes

pengzhao-intel reviewed Apr 22, 2019

View reviewed changes

src/operator/rnn.cc Outdated Show resolved Hide resolved

Li, Hao H added 8 commits April 23, 2019 13:49

fix some comments

4b45093

Merge pull request #28 from apache/master

ed641f9

sync to latest code

Merge pull request #29 from apache/master

72d3f50

fix min max on zero-sized ndarray (apache#14745)

Merge pull request #30 from apache/master

8c526b6

sync to lastest code

fix issue

48c5808

Merge pull request #31 from apache/master

44be87a

sync code to latest

Merge pull request #32 from apache/master

2c1b8bf

sync to latest code

fix comment

82445ee

zixuanweeei added 4 commits May 20, 2019 13:01

fix bugs in memory adjustment

d926794

Reformat TODO for MKLDNN GRU

980ce85

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

18d512f

…to mkldnn_lstm_infer_fp32

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

adfa2ef

…to mkldnn_lstm_infer_fp32

zixuanweeei added 2 commits May 22, 2019 16:14

Reserve original RNN path

620bad3

Remove MKLDNN GRU

bc66cdf

pengzhao-intel moved this from In progress to Review in progress in CPU Performance and Quantization May 22, 2019

Fix bug for rnn forward

02ee005

roywei mentioned this pull request May 22, 2019

Flakey test: test_operator_gpu.py:test_rnntanh_bidirectional #15034

Open

pengzhao-intel changed the title ~~[WIP]MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu)~~ MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) May 23, 2019

zixuanweeei added 3 commits May 23, 2019 15:37

Remove __CUDAACC__

41998f9

Move RNNStatefulComputeCPU to rnn.cc

f165f40

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

8245817

…to mkldnn_lstm_infer_fp32

pengzhao-intel reviewed May 23, 2019

View reviewed changes

CPU Performance and Quantization automation moved this from Review in progress to Reviewer approved May 23, 2019

TaoLv approved these changes May 23, 2019

View reviewed changes

Remove redundent macro of __CUDACC__

5e8c20e

pengzhao-intel reviewed May 23, 2019

View reviewed changes

src/operator/rnn-inl.h Outdated Show resolved Hide resolved

pengzhao-intel approved these changes May 23, 2019

View reviewed changes

Remove the last macro __CUDACC__ from rnn*

41772c9

pengzhao-intel merged commit 653cbb4 into apache:master May 24, 2019

CPU Performance and Quantization automation moved this from Reviewer approved to Done May 24, 2019

pengzhao-intel mentioned this pull request May 25, 2019

[Discussion] 1.5.0 Roadmap #14619

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) #14713

MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) #14713

lihaofd commented Apr 17, 2019 •

edited

pengzhao-intel commented Apr 17, 2019

pengzhao-intel commented Apr 18, 2019

anirudh2290 commented Apr 18, 2019

TaoLv Apr 21, 2019

lihaofd Apr 23, 2019

TaoLv Apr 21, 2019

lihaofd Apr 23, 2019

pengzhao-intel commented May 22, 2019

pengzhao-intel commented May 23, 2019

pengzhao-intel May 23, 2019

TaoLv May 23, 2019

zixuanweeei May 23, 2019

TaoLv left a comment

TaoLv May 23, 2019

TaoLv May 23, 2019 •

edited

pengzhao-intel left a comment

zixuanweeei commented May 23, 2019

DickJC123 commented May 23, 2019

pengzhao-intel commented May 24, 2019

pengzhao-intel commented May 24, 2019

MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) #14713

MKLDNN RNN Inference Integration(fp32 LSTM and vRNN with tanh and relu) #14713

Conversation

lihaofd commented Apr 17, 2019 • edited

Description

Feature changes

New features

Unit-test changes

Performance

Checklist

pengzhao-intel commented Apr 17, 2019

pengzhao-intel commented Apr 18, 2019

anirudh2290 commented Apr 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pengzhao-intel commented May 22, 2019

pengzhao-intel commented May 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TaoLv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TaoLv May 23, 2019 • edited

Choose a reason for hiding this comment

pengzhao-intel left a comment

Choose a reason for hiding this comment

zixuanweeei commented May 23, 2019

Layer = 1, bidirectional = False

Layer = 1, bidirectional = True

Layer = 5, bidirectional = False

Layer = 5, bidirectional = True

DickJC123 commented May 23, 2019

pengzhao-intel commented May 24, 2019

pengzhao-intel commented May 24, 2019

lihaofd commented Apr 17, 2019 •

edited

TaoLv May 23, 2019 •

edited