This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15741

Closed

zixuanweeei wants to merge 40 commits into apache:master from zixuanweeei:MKLDNN-LBR-GRU

Contributor

zixuanweeei commented Aug 3, 2019 •

edited

Description

Reopen #15621 here. We integrated the mkl-dnn Linear-Before-Reset GRU into MXNet. Currently, it supports FP32 inference. Please take some reviews on this PR.@ciyongch @TaoLv @pengzhao-intel

Performance

We tested the performance of FusedRNN with mode='gru' using the same dimension as that in PR#14713, i.e. seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800.

mode	Layer	Direction	MXNET_USE_MKLDNN_RNN=0		MXNET_USE_MKLDNN_RNN=1		SpeedUp
mode	Layer	Direction	Throughput (samples/sec)	Latency (ms)	Throughput (samples/sec)	Latency (ms)	Throughtput	Latency
gru	1	1	430.03	20.43	806.27	4.28	1.87	4.78
gru	1	2	218.58	119.50	416.55	8.58	1.91	13.93
gru	5	1	89.47	100.07	177.52	21.20	1.98	4.72
gru	5	2	39.68	611.38	71.15	46.45	1.79	13.16

We also compared the performance of this PR with that of the previously integrated LSTM, vRNN tanh, vRNN Relu on branch master. It seems that there is a distinct regression with mode='lstm'.

Mode	Layer	Direction	`a26af2b`		This PR ( `cfc6910` )		Gap
Mode	Layer	Direction	Throughput (samples/sec)	Latency (ms)	Throughput (samples/sec)	Latency (ms)	Throughput	Latency
lstm	1	1	630.78	4.82	670.23	4.87	1.06	0.99
lstm	1	2	313.71	9.68	338.51	9.72	1.08	1.00
lstm	5	1	139.85	23.59	138.22	23.48	0.99	1.00
lstm	5	2	54.63	51.19	54.27	51.28	0.99	1.00
rnn_tanh	1	1	1573.45	2.44	1576.23	2.51	1.00	0.97
rnn_tanh	1	2	836.43	4.63	830.33	4.67	0.99	0.99
rnn_tanh	5	1	381.32	11.44	379.88	11.50	1.00	1.00
rnn_tanh	5	2	159.76	24.92	149.86	24.90	0.94	1.00
rnn_relu	1	1	1536.55	2.65	1540.29	2.75	1.00	0.96
rnn_relu	1	2	805.00	5.09	807.68	5.06	1.00	1.01
rnn_relu	5	1	373.27	12.41	377.79	12.32	1.01	1.01
rnn_relu	5	2	154.21	26.93	153.80	26.61	1.00	1.01

zixuanweeei added 2 commits

August 3, 2019 08:07


          LBR-GRU integration

fa25f6b


          Unit tests for RNN fullfilled

fd1e214

zixuanweeei mentioned this pull request

MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15621

Closed

4 tasks

Contributor

pengzhao-intel commented Aug 3, 2019

what's the reason to open a new PR instead of the previous one?

pengzhao-intel added this to In progress in CPU Performance and Quantization via automation

pengzhao-intel added the MKLDNN label

Contributor Author

zixuanweeei commented Aug 4, 2019 •

edited

@pengzhao-intel I incorrectly used git rebase, which introduced all the changes on apache/incubator-mxnet/master since last merged commits into that branch. So I cut off a new branch.

Contributor

pengzhao-intel commented Aug 4, 2019

@pengzhao-intel I incorrectly used git rebase, which introduced all the changes on apache/incubator-mxnet/master since last merged commits into that branch. So I cut off a new branch.

Thanks for the explanation.

Contributor

pengzhao-intel commented Aug 4, 2019

Do all comments in the original thread are resolved?

CPU Performance and Quantization automation moved this from In progress to Reviewer approved

pengzhao-intel approved these changes

View reviewed changes

Contributor

pengzhao-intel left a comment

LGTM and will merge tomorrow if there are no other comments.

Contributor Author

zixuanweeei commented Aug 4, 2019

Do all comments in the original thread are resolved?

Yes, all comments are resolved.

@TaoLv Could you check on this PR for LBR-GRU integration again? Specifically, the type of input params of GetMKLDNNRNNCacheMemorySize is changed to size_t,
https://github.com/apache/incubator-mxnet/blob/fd1e21443eaad240210079b0fbebd5c062f06663/src/operator/nn/mkldnn/mkldnn_rnn_impl.h#L138-L144

and using reference or pointer to access a memory.

https://github.com/apache/incubator-mxnet/blob/fd1e21443eaad240210079b0fbebd5c062f06663/src/operator/nn/mkldnn/mkldnn_rnn_impl.h#L304

https://github.com/apache/incubator-mxnet/blob/fd1e21443eaad240210079b0fbebd5c062f06663/src/operator/nn/mkldnn/mkldnn_rnn_impl.h#L472

https://github.com/apache/incubator-mxnet/blob/fd1e21443eaad240210079b0fbebd5c062f06663/src/operator/nn/mkldnn/mkldnn_rnn_impl.h#L490-L491

ciyongch reviewed

View reviewed changes

src/operator/nn/mkldnn/mkldnn_rnn_impl.h Outdated

    
                #pragma omp parallel for num_threads(omp_threads)

                for (int i = 0; i < I * H; i++) {

                for (int i = 0; i < input_size * hidden_size; i++) {

Contributor

ciyongch Aug 5, 2019

Better to move this expression input_size * hidden_size ahead of for loop.

src/operator/nn/mkldnn/mkldnn_rnn_impl.h Outdated

-                const int single_cell_size = N * H;
-                const int single_b_size = ngates * H;
-                int w_size = (I + H) * H * ngates;
+                const int cell_size = batch_size * hidden_size;

Contributor

ciyongch Aug 5, 2019

Change all these sizes from int to size_t?

src/operator/nn/mkldnn/mkldnn_rnn_impl.h Outdated

    
                        mkldnn_mems->hcx_memory[layer_index], mkldnn_mems->wx_memory[layer_index],

                        mkldnn_mems->wh_memory[layer_index], mkldnn_mems->bias_memory[layer_index],

                        mkldnn_mems->y_memory[layer_index],

                       mkldnn_mems->hcy_memory[layer_index], null_memory_);

Contributor

ciyongch Aug 5, 2019

nit:indent.

src/operator/nn/mkldnn/mkldnn_rnn_impl.h Outdated

+                  if (mode == rnn_enum::kGru) {
+                    const int mx_single_b_sz = ngates * hidden_size;
+                    for (int l = 0; l < num_layer; l++) {
+                      #pragma omp parallel for num_threads(omp_threads)

Contributor

ciyongch Aug 5, 2019

We could use collapse(2) for these two for loops instead of only the inner loop. But note that Microsoft Visual C++ compiler might not support collapse, which could be separated by Macro.

Contributor Author

zixuanweeei Aug 5, 2019

Thanks for noting that.

src/operator/nn/mkldnn/mkldnn_rnn_impl.h Outdated

                   if (mode == rnn_enum::kLstm) {
-                    for (int l = 0; l < L; l++) {
+                    for (int l = 0; l < num_layer; l++) {
                       offset1 = l * single_cell_size;

Contributor

ciyongch Aug 5, 2019

Can you also make offset1 and offset2 more readable?

zixuanweeei added 4 commits

August 5, 2019 12:56


          Collapse for-loop, readable offset, size_t vars

87e58d9


          Fix OpenMP incompatible unsigned int with MSVC on windows

d1ced43


          Explicitly convert size_t to int

44f3b79


          Trigger CI

095a294

Contributor Author

zixuanweeei commented Aug 5, 2019 •

edited

@pengzhao-intel This PR might also contain flaky unit tests with gpu context. For instance,

test_operator_gpu.test_rnnrelu_sym: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15741/4/pipeline

test_operator_gpu.test_rnnrelu_sym: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-15621/2/pipeline

test_operator_gpu.test_rnntanh_bidirectional: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15621/2/pipeline

test_operator_gpu.test_rnntanh_bidirectional: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15621/8/pipeline

test_operator_gpu.test_rnnrelu_sym: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15741/5/pipeline


          Merge branch 'master' and tigger CI

89b459c

Contributor

pengzhao-intel commented Aug 5, 2019

Could you analyze why the flaky test is generated? It's from the numerical difference or algorithm level difference.

Contributor Author

zixuanweeei commented Aug 5, 2019

I tried to reproduce the failures on our internal GPU platforms, but all worked well. And it should be noticed that the source code was compiled with cuda-9.0 and cudnn-9.0-linux-x64-v7.1.2, which are older that the oldest version tested by CI (cuda-10.x and cudnn-xxx-v7.6).

zixuanweeei added 5 commits

August 6, 2019 08:36


          Re-trigger CI

5e81ace


          Using Resource to manage temp space, RNNOp public mem vars shift

2c2a29b


          Shift cudnn mem vars to private

a46163e


          Trigger CI

6fd9e04


          Shift cpu mem vars to private, trigger CI

22cba57

Contributor

pengzhao-intel commented Aug 8, 2019

@ciyongch please take a review again :)
@DickJC123 @ptrendx some common codes are adjusted to make it work for both CPU and GPU based on your previous approach. Would you mind take a look as well?

Contributor

pengzhao-intel commented Aug 8, 2019

cc @szha @eric-haibin-lin

Contributor

DickJC123 commented Aug 8, 2019

Thanks for the heads up about the changes surrounding gpu code. With the reported flakiness, I'd like to have tomorrow (Friday) to investigate.

Contributor Author

zixuanweeei commented Aug 9, 2019 •

edited

@DickJC123 Thanks for your patience. And FYI, it seems that the possible flaky tests become effective with the edited UTs for RNN variants. I have tried to modify the code following the instructions from #14476 (review).

Specifically,

the temp_space, which is renamed to workspace_, is allocated along with reserve_space_ in Init() at #L1485-L1490.
then, host_workspace, renamed to seq_len_space_, is allocated in an if (!init_cudnn_) {...} branch at #L622-L631.

And all the spaces above are allocated once only using ctx.requested[rnn_enum::kTempSpace] instead of Storage. But it didn't work on *NIX system (CI has passed on windows-gpu, while it was failed on *NIX-gpu with test_gluon_gpu.test_layer_bidirectional_varseqlength). Though the modifications are not included in this PR, you can find the whole source from this link. I have no idea about whether the temp_space and host_workspace should be re-initialized for every iteration. Need your help since I'm not familiar with GPU :)

zixuanweeei requested a review from szha as a code owner

August 15, 2019 07:49

zixuanweeei and others added 4 commits

August 20, 2019 08:45


          Merge 'master' & trigger CI

b3cebfd


          Trigger CI

ca0de1c


          Weights memory bug fix

995ff1f


          Bump cudnn version to 7.6.0.64

92e5203

(cherry picked from commit 1cf63e1)

Contributor Author

zixuanweeei commented Aug 20, 2019

Cherry picked from commit 1cf63e1 according to #15847 (comment)

zixuanweeei added 5 commits

August 20, 2019 20:58


          Trigger CI

0f23e7d


          Trigger CI

bb0331a


          Use NDArray to manage temp memory

169be0e


          Correct way to use NDArray

3ac8cbc


          Merge

ca5a96e

Contributor

pengzhao-intel commented Aug 22, 2019

@TaoLv please take a review again and I plan to merge after the CI pass.

zixuanweeei added 9 commits

August 22, 2019 09:59


          Trigger CI

e9f4423


          Trigger CI

4a3a2b3


          Merge branch 'master' into MKLDNN-LBR-GRU


          trigger

dbb7dd9


          Merge clang fix

47a65c9


          Trigger CI with a large absolute tolerance 1e-4 -> 2e-4

7d6e938


          Merge branch 'master' into MKLDNN-LBR-GRU

6fe260d


          NDArray

96e4a33


          Indent, remove TempResource for CPU context

22d86f7

zixuanweeei requested a review from marcoabreu as a code owner

August 26, 2019 08:59


          Trigger CI

Contributor

pengzhao-intel commented Aug 29, 2019

If it still needs lots of efforts to pass ci, we can drop it and wait to our 1.0 upgrade.
@zixuanweeei you can make a decision :)

Contributor Author

zixuanweeei commented Aug 29, 2019

@pengzhao-intel Sure. There are lots of refactor work both on MKL-DNN RNN and naive RNN. At present, MKL-DNN related stuff is under review. Perhaps, we can just drop this PR, and start a new one from current commit on master.

Member

eric-haibin-lin commented Sep 2, 2019

What does Linear-Before-Reset mean?

Member

TaoLv commented Sep 2, 2019

What does Linear-Before-Reset mean?

See the different definition of c(t) in GRU and LBR GRU: https://intel.github.io/mkl-dnn/dev_guide_rnn.html#Linear-before-reset-GRU

Contributor

pengzhao-intel commented Sep 15, 2019

closing this PR since we will migrate it with MKL-DNN 1.0.

pengzhao-intel closed this

CPU Performance and Quantization automation moved this from Reviewer approved to Done

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Reviewers

ciyongch ciyongch left review comments

pengzhao-intel pengzhao-intel approved these changes

szha Awaiting requested review from szha szha is a code owner

marcoabreu Awaiting requested review from marcoabreu marcoabreu is a code owner