Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15741

Closed
wants to merge 40 commits into from
Closed

MKL-DNN LBR-GRU Inference Integration (FP32 LBR-GRU) #15741

wants to merge 40 commits into from

Conversation

zixuanweeei
Copy link
Contributor

@zixuanweeei zixuanweeei commented Aug 3, 2019

Description

Reopen #15621 here. We integrated the mkl-dnn Linear-Before-Reset GRU into MXNet. Currently, it supports FP32 inference. Please take some reviews on this PR.@ciyongch @TaoLv @pengzhao-intel

Performance

We tested the performance of FusedRNN with mode='gru' using the same dimension as that in PR#14713, i.e. seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800.

mode Layer Direction MXNET_USE_MKLDNN_RNN=0 MXNET_USE_MKLDNN_RNN=1 SpeedUp
Throughput (samples/sec) Latency (ms) Throughput (samples/sec) Latency (ms) Throughtput Latency
gru 1 1 430.03 20.43 806.27 4.28 1.87 4.78
gru 1 2 218.58 119.50 416.55 8.58 1.91 13.93
gru 5 1 89.47 100.07 177.52 21.20 1.98 4.72
gru 5 2 39.68 611.38 71.15 46.45 1.79 13.16

We also compared the performance of this PR with that of the previously integrated LSTM, vRNN tanh, vRNN Relu on branch master. It seems that there is a distinct regression with mode='lstm'.

Mode Layer Direction a26af2b This PR ( cfc6910 ) Gap
Throughput (samples/sec) Latency (ms) Throughput (samples/sec) Latency (ms) Throughput Latency
lstm 1 1 630.78 4.82 670.23 4.87 1.06 0.99
lstm 1 2 313.71 9.68 338.51 9.72 1.08 1.00
lstm 5 1 139.85 23.59 138.22 23.48 0.99 1.00
lstm 5 2 54.63 51.19 54.27 51.28 0.99 1.00
rnn_tanh 1 1 1573.45 2.44 1576.23 2.51 1.00 0.97
rnn_tanh 1 2 836.43 4.63 830.33 4.67 0.99 0.99
rnn_tanh 5 1 381.32 11.44 379.88 11.50 1.00 1.00
rnn_tanh 5 2 159.76 24.92 149.86 24.90 0.94 1.00
rnn_relu 1 1 1536.55 2.65 1540.29 2.75 1.00 0.96
rnn_relu 1 2 805.00 5.09 807.68 5.06 1.00 1.01
rnn_relu 5 1 373.27 12.41 377.79 12.32 1.01 1.01
rnn_relu 5 2 154.21 26.93 153.80 26.61 1.00 1.01

@pengzhao-intel
Copy link
Contributor

what's the reason to open a new PR instead of the previous one?

@pengzhao-intel pengzhao-intel added this to In progress in CPU Performance and Quantization via automation Aug 3, 2019
@zixuanweeei
Copy link
Contributor Author

zixuanweeei commented Aug 4, 2019

@pengzhao-intel I incorrectly used git rebase, which introduced all the changes on apache/incubator-mxnet/master since last merged commits into that branch. So I cut off a new branch.

@pengzhao-intel
Copy link
Contributor

@pengzhao-intel I incorrectly used git rebase, which introduced all the changes on apache/incubator-mxnet/master since last merged commits into that branch. So I cut off a new branch.

Thanks for the explanation.

@pengzhao-intel
Copy link
Contributor

Do all comments in the original thread are resolved?

CPU Performance and Quantization automation moved this from In progress to Reviewer approved Aug 4, 2019
Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and will merge tomorrow if there are no other comments.

#pragma omp parallel for num_threads(omp_threads)
for (int i = 0; i < I * H; i++) {
for (int i = 0; i < input_size * hidden_size; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to move this expression input_size * hidden_size ahead of for loop.

const int single_cell_size = N * H;
const int single_b_size = ngates * H;
int w_size = (I + H) * H * ngates;
const int cell_size = batch_size * hidden_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change all these sizes from int to size_t?

mkldnn_mems->hcx_memory[layer_index], mkldnn_mems->wx_memory[layer_index],
mkldnn_mems->wh_memory[layer_index], mkldnn_mems->bias_memory[layer_index],
mkldnn_mems->y_memory[layer_index],
mkldnn_mems->hcy_memory[layer_index], null_memory_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:indent.

if (mode == rnn_enum::kGru) {
const int mx_single_b_sz = ngates * hidden_size;
for (int l = 0; l < num_layer; l++) {
#pragma omp parallel for num_threads(omp_threads)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use collapse(2) for these two for loops instead of only the inner loop. But note that Microsoft Visual C++ compiler might not support collapse, which could be separated by Macro.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noting that.

if (mode == rnn_enum::kLstm) {
for (int l = 0; l < L; l++) {
for (int l = 0; l < num_layer; l++) {
offset1 = l * single_cell_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also make offset1 and offset2 more readable?

@pengzhao-intel
Copy link
Contributor

Could you analyze why the flaky test is generated? It's from the numerical difference or algorithm level difference.

@zixuanweeei
Copy link
Contributor Author

I tried to reproduce the failures on our internal GPU platforms, but all worked well. And it should be noticed that the source code was compiled with cuda-9.0 and cudnn-9.0-linux-x64-v7.1.2, which are older that the oldest version tested by CI (cuda-10.x and cudnn-xxx-v7.6).

@pengzhao-intel
Copy link
Contributor

@ciyongch please take a review again :)
@DickJC123 @ptrendx some common codes are adjusted to make it work for both CPU and GPU based on your previous approach. Would you mind take a look as well?

@pengzhao-intel
Copy link
Contributor

cc @szha @eric-haibin-lin

@DickJC123
Copy link
Contributor

Thanks for the heads up about the changes surrounding gpu code. With the reported flakiness, I'd like to have tomorrow (Friday) to investigate.

@zixuanweeei
Copy link
Contributor Author

zixuanweeei commented Aug 9, 2019

@DickJC123 Thanks for your patience. And FYI, it seems that the possible flaky tests become effective with the edited UTs for RNN variants. I have tried to modify the code following the instructions from #14476 (review).

Specifically,

  • the temp_space, which is renamed to workspace_, is allocated along with reserve_space_ in Init() at #L1485-L1490.
  • then, host_workspace, renamed to seq_len_space_, is allocated in an if (!init_cudnn_) {...} branch at #L622-L631.

And all the spaces above are allocated once only using ctx.requested[rnn_enum::kTempSpace] instead of Storage. But it didn't work on *NIX system (CI has passed on windows-gpu, while it was failed on *NIX-gpu with test_gluon_gpu.test_layer_bidirectional_varseqlength). Though the modifications are not included in this PR, you can find the whole source from this link. I have no idea about whether the temp_space and host_workspace should be re-initialized for every iteration. Need your help since I'm not familiar with GPU :)

@zixuanweeei zixuanweeei requested a review from szha as a code owner August 15, 2019 07:49
@zixuanweeei
Copy link
Contributor Author

Cherry picked from commit 1cf63e1 according to #15847 (comment)

@pengzhao-intel
Copy link
Contributor

@TaoLv please take a review again and I plan to merge after the CI pass.

@pengzhao-intel
Copy link
Contributor

If it still needs lots of efforts to pass ci, we can drop it and wait to our 1.0 upgrade.
@zixuanweeei you can make a decision :)

@zixuanweeei
Copy link
Contributor Author

@pengzhao-intel Sure. There are lots of refactor work both on MKL-DNN RNN and naive RNN. At present, MKL-DNN related stuff is under review. Perhaps, we can just drop this PR, and start a new one from current commit on master.

@eric-haibin-lin
Copy link
Member

What does Linear-Before-Reset mean?

@TaoLv
Copy link
Member

TaoLv commented Sep 2, 2019

What does Linear-Before-Reset mean?

See the different definition of c(t) in GRU and LBR GRU: https://intel.github.io/mkl-dnn/dev_guide_rnn.html#Linear-before-reset-GRU

@pengzhao-intel
Copy link
Contributor

closing this PR since we will migrate it with MKL-DNN 1.0.

CPU Performance and Quantization automation moved this from Reviewer approved to Done Sep 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants