[WIP][MXNET-107] Fused LSTM implementation for CPU #10104

chenchu-zs · 2018-03-14T09:19:08Z

Description

In this PR, a fused LSTM operator for CPU is implemented. More supports for other RNN variants are WIP and will be submitted in other PRs.

Feature changes

New features

Fused LSTM implemention, including both forward and backward computation.
Share the same frontend interfaces with current sym.RNN operator
Share the same algorithm and input layout with current sym.RNN operator
~~Refactor code and register it with NNVM interfaces~~
Support both FP32 and FP64 inputs, more types will be supported later, such as int8
Provide more extensible APIs for other RNN variants(vanilla RNN/GRU)

Unit-test changes

Create new test test_lstm and test_lstm_bidirectionalin tests/python/unittests/test_operator.py
Check consistency with original LSTMCell implementation

Performance

We have tested performance of sym.RNN and rnn.LSTMCell on our local Skylake-8180 with 2 Sockets and 56 cores. Use MKL as blas lib in this performance test.
Test input size is from DS2 default parameters(seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800).

single layer measurement:

API	Inference time(fwd, sec)	Training time(fwd + bwd, sec)
rnn.LSTMCell	0.106902	0.273108
#9977	0.050126	---
this PR	0.050668	0.130266
speedup	2.1x	2.1x

multi layer measurement: num_layers=5

API	Inference time(fwd, sec)	Training time(fwd + bwd, sec)
rnn.LSTMCell	0.532034	1.546486
sym.RNN(#9977)	0.18641	---
sym.RNN(this PR)	0.190032	0.619439
rnn.LSTMCell(cuda)	0.231355	0.785780
sym.RNN(cudnn)	0.060647	0.161115
speedup #10104 /LSTMCell	285.41%	249.66%
speedup #10104 /LSTMCell(cuda)	124.09%	126.85%
speedup #10104 / sym.RNN(cudnn)	32.53%	26.01%

Opens

~~Fix cudnn registeration in this PR~~
Add multi-layer and bidirectional support for LSTM.
Support gluon interfaces
fix NNVM registration
Other RNN variants (will be added in in other PRs)
Add dropout support(in other PRs)

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

TaoLv · 2018-03-14T09:25:32Z

@szha @Jerryzcn @eric-haibin-lin @pengzhao-intel Could you help to review this PR? Need cooperation to refactor cudnn registration.

TaoLv · 2018-03-14T09:31:47Z

src/operator/rnn-inl.h

+  size_t size = 0;
+  switch (mode) {
+    case rnn_enum::kRnnRelu:
+      break;


Need error message for unimplemented modes.

TaoLv · 2018-03-14T09:33:14Z

src/operator/rnn-inl.h

+      size = (seq_length + 1) * batch_size * hidden_size * 4 + batch_size * hidden_size;
+      break;
+    case rnn_enum::kGru:
+      break;


Add default statement for code robustnesss.

TaoLv · 2018-03-14T09:34:54Z

src/operator/rnn-inl.h

+                                 w_ptr, y_ptr, hy_ptr, cy_ptr);
+      break;
+    case rnn_enum::kGru:
+      break;


Also need error message for unimplemented modes and default statement for switch-case.

TaoLv · 2018-03-14T09:39:26Z

src/operator/rnn.cc

@@ -19,40 +19,214 @@

 /*!
 * Copyright (c) 2015 by Contributors
- * \file rnn.cc
+ * \file    rnn.cc


remove this change

TaoLv · 2018-03-14T09:39:46Z

src/operator/rnn.cc

 * \brief
- * \author Sebastian Bodenstein
+ * \author  Sebastian Bodenstein, Shu Zhang(shu.zhang@intel.com)


remove whitespace

TaoLv · 2018-03-14T09:40:14Z

src/operator/rnn.cc

+  }
+}
+static inline int NumVisibleOutputs(const NodeAttrs& attrs) {
+    const RNNParam& params = nnvm::get<RNNParam>(attrs.parsed);


fix indents in this function.

TaoLv · 2018-03-14T09:42:05Z

src/operator/rnn.cc

-MXNET_REGISTER_OP_PROPERTY(RNN, RNNProp)
-.describe("Applies a recurrent layer to input.")
+inline static bool RNNStorageType(const nnvm::NodeAttrs& attrs,
+                                   const int dev_mask,


fix indent. Align with the first parameter.

TaoLv · 2018-03-14T09:42:29Z

src/operator/rnn.cc

+
+inline static bool BackwardRNNStorageType(const nnvm::NodeAttrs& attrs,
+                                   const int dev_mask,
+                                   DispatchMode* dispatch_mode,


fix indent. Align with the first parameter.

TaoLv · 2018-03-14T09:52:35Z

Seems collapse clause in omp parallel for is not suppoted on Windows.

eric-haibin-lin · 2018-03-14T21:55:54Z

tests/python/unittest/test_operator.py

+    wh = mx.random.uniform(-1, 1, (4 * H, H), ctx=xpu,dtype=type1)
+    bx = mx.nd.zeros((4 * H,), ctx=xpu, dtype=type1)
+    bh = mx.nd.zeros((4 * H,), ctx=xpu, dtype=type1)
+    x1.attach_grad()


why do you need to manually attach grad??

attach_grad is used to create gradient buffer for these NDArrays here. Do you mean this can be implemented in other ways or do you have any suggestion about this piece of code?

In case use stateful OP, what's your opinion @eric-haibin-lin ?

eric-haibin-lin · 2018-03-14T21:56:18Z

tests/python/gpu/test_operator_gpu.py

@@ -1540,6 +1548,7 @@ def check_rnn_layer_w_rand_inputs(layer):
    for g, c in zip(gs, cs):
        assert_almost_equal(g.asnumpy(), c.asnumpy(), rtol=1e-2, atol=1e-6)

+@unittest.skip("test fails intermittently. temporarily disabled till it gets fixed.")


Why is it failing ?

If USE_CUDNN=1, I think this test will run into cudnn implementation which has been disabled temporarily. We will reopen this test case after we add cudnn back. In fact, building on gpu is failed currently. We are working on the failure.

eric-haibin-lin · 2018-03-14T21:56:56Z

src/operator/rnn.cc

+};
+
+NNVM_REGISTER_OP(RNN)
+.describe(R"code(Applies a recurrent layer to input


Please provide more detailed descriptions

okay, will do.

eric-haibin-lin · 2018-03-14T21:57:09Z

src/operator/rnn-inl.h

+    DType* reserve_space_ptr = out_data[out_expected - 1].dptr<DType>();
+
+    // allocate temp space
+    size_t workspace_size = GetRNNWorkspaceSize(param_.seq_length_, param_.batch_size_,


eric-haibin-lin · 2018-03-14T21:57:26Z

src/operator/rnn-inl.h

+    Tensor<cpu, 1, DType> workspace = ctx.requested[rnn_enum::kTempSpace]
+        .get_space_typed<cpu, 1, DType>(Shape1(workspace_size), s);
+
+    int direction = param_.bidirectional ? 2 : 1;


Jerryzcn · 2018-03-14T22:04:07Z

I used

for (int64_t ji = 0; ji < length; ++ji) {
      int64_t j = ji / h_channel;  // batch dim
      int64_t i = ji % h_channel;

to replace collapse

TaoLv · 2018-03-15T02:08:34Z

@Jerryzcn Good suggestion. I will take a try.

TaoLv · 2018-03-15T02:12:57Z

BTW, is there any existing jira issue for RNN implementation? Do I need to create a jira issue for this PR? @eric-haibin-lin @Jerryzcn @szha

eric-haibin-lin · 2018-03-15T02:57:30Z

pls create one

Jerryzcn · 2018-03-16T02:53:04Z

should we create a separate branch for cpu rnn. Once all the changes are checked in, we merge the rnn branch with the master. This way the master won't break people's code.

piiswrong · 2018-03-16T07:00:10Z

@Jerryzcn good idea. @eric-haibin-lin please open an branch

szha · 2018-03-16T07:14:28Z

Happen to be around on github. I created the branch cpu_fused_rnn and updated PR base.

TaoLv · 2018-03-16T07:37:32Z

Thanks @szha, will keep working on this.

pengzhao-intel · 2018-03-18T02:48:26Z

@sherry-zhang Good Job!
Please update this PR's description for multiple layers and bidirectional function 👍

TaoLv · 2018-03-22T05:49:35Z

@marcoabreu I am working on branch cpu_fused_rnn, but CI fails in sanity check. I doubt that CI environment has been adjusted for master branch, so this branch cannot work properly. Could you help take a look? Thanks.

pylint check is passed on my local server but fails in snanity check:

Makefile:479: recipe for target 'pylint' failed
make: *** [pylint] Error 22
build.py: 2018-03-22 04:55:51,746 Running of command in container failed: docker run --rm -v /home/jenkins_slave/workspace/sanity:/work/mxnet -v /home/jenkins_slave/workspace/sanity/build:/work/build -u 1001:1001 mxnet/build.ubuntu_cpu /work/runtime_functions.sh sanity_check

build.py: 2018-03-22 04:55:51,746 You can try to get into the container by using the following command: docker run --rm -v /home/jenkins_slave/workspace/sanity:/work/mxnet -v /home/jenkins_slave/workspace/sanity/build:/work/build -u 1001:1001 -ti --entrypoint bash mxnet/build.ubuntu_cpu /work/runtime_functions.sh sanity_check

Traceback (most recent call last):
  File "ci/build.py", line 179, in <module>
    sys.exit(main())
  File "ci/build.py", line 159, in main
    container_run(platform, docker_binary, command)
  File "ci/build.py", line 110, in container_run
    raise subprocess.CalledProcessError(ret, cmd)
subprocess.CalledProcessError: Command 'docker run --rm -v /home/jenkins_slave/workspace/sanity:/work/mxnet -v /home/jenkins_slave/workspace/sanity/build:/work/build -u 1001:1001 mxnet/build.ubuntu_cpu /work/runtime_functions.sh sanity_check' returned non-zero exit status 2

script returned exit code 1

piiswrong · 2018-03-22T06:30:29Z

Looks like this is not going to make it into 1.2
can we revert the lstm forward part that's already merged into master so that we don't ship half baked feature?

szha · 2018-03-22T06:41:41Z

@piiswrong the merged RNN feature supports inference-only LSTM that is compatible with cudnn implementation. Gluon LSTM layer now supports inference-only forwarding with this feature, and the rest of the use cases are still on old code paths, thanks to @Jerryzcn. The merged PR does what it sets out to do better than what previously exists, so it's more than half baked.

szha · 2018-03-22T06:46:43Z

cc @zhiheng-huang as his team will likely be impacted by the decision of reverting Jerry's PR.

marcoabreu · 2018-03-22T08:52:23Z

For now, please fork the master branch in your own repository and let collaborators make PRs towards your repository. At the same time, create a PR from your fork towards the master branch to have constant feedback every time a commit to your fork is being made.

TaoLv · 2018-03-22T13:33:47Z

@szha @piiswrong May I have your opinions? I don't have permission to create/delete branchs and redirect this PR to master branch. I can rebase code to master branch if needed.

marcoabreu · 2018-03-22T14:08:13Z

I have changed the base branch as requested. We currently have an internal discussion about whether we support feature-branches in the official repository, until then, it would be better to work towards the master to ensure your PR is always receiving the latest updates. I have also retriggered CI.

TaoLv · 2018-03-22T14:15:04Z

Thanks, @marcoabreu! Really understand your concern. Will keep working on this PR.

marcoabreu · 2018-03-22T14:16:48Z

Thanks a lot! Please excuse the inconvenience - in case of further problems, feel free to ping me again.

TaoLv · 2018-05-08T14:40:17Z

I feel it difficult to change the existing gluon LSTM layer from normal Block to HybridBlock without changing APIs.
(1) I need concatenate the exsiting i2h_weight, h2h_weight, i2h_bias and h2h_bias together to feed them into the fused operator. I think that is time consuming. link
(2) I cannot create begin_state if it's not presented in the hybrid_forward function, since I cannot get the shape and batch size here in a HybridBlock. link
Maybe I missed something. Any cues about it? @szha @piiswrong

piiswrong · 2018-05-08T20:06:57Z

@TaoLv I'll look into this later. I think you can do it similar to RNNCell.
Let's merge the backend LSTM implementation first. Is there enough test?

marcoabreu · 2018-05-09T14:31:21Z

tests/python/unittest/test_operator.py

+        out = exe.forward(is_train=False)
+        out[0].wait_to_read()
+        assert False  # should not reach here
+    except mx.base.MXNetError as err:


Excellent approach! This will ensure we don't miss it to re-enable the test when we introduce dropout. Great job

Yes. Also to ensure the failure happens at a proper position and correct error message is presented. Follow @reminisce 's idea in PR 10844 .

TaoLv · 2018-05-09T15:09:39Z

@piiswrong I added a test for dropout. I think this lstm operator is good to merge. Dropout support and hybrid rnn layer are WIP and will be submitted in another PR. I will also rebase #10311 accordingly.
If you have any idea or design conception of hybrid rnn layer, please let me know.

piiswrong · 2018-05-11T18:44:59Z

src/operator/rnn_impl.hpp

@@ -0,0 +1,454 @@
+/*


we don't use hpp. Please rename to .h

piiswrong · 2018-05-11T18:46:18Z

src/operator/rnn_impl.hpp

+  for (int i = 0; i < T; ++i) {
+    int t = bid ? T - 1 - i : i;
+    linalg_gemm(i ? h : hx, wh, yh_flat, alpha, beta, false, true);
+    #pragma omp parallel for


use https://github.com/apache/incubator-mxnet/blob/master/src/operator/mxnet_op.h#L435 to get recommended number of threads for openmp

piiswrong · 2018-05-11T18:47:04Z

src/operator/rnn_impl.hpp

+      }
+    }
+  }
+  memcpy(y_ptr, rs + y_offset, T * N * H * D * sizeof(DType));


why is copy needed?

One copy is for forward output and the other copy is for the reuse in backward computation.

piiswrong · 2018-05-11T18:47:14Z

src/operator/rnn_impl.hpp

+  for (int i = 0; i < T; ++i) {
+    int t = bid ? T - 1 - i : i;
+    linalg_gemm(i ? h : hx, wh, yh_flat, alpha, beta, false, true);
+    #pragma omp parallel for


piiswrong · 2018-05-11T18:48:13Z

src/operator/rnn_impl.hpp

+    const Tensor<cpu, 2, DType>& dcnext = i ? dc : dcx;
+    const Tensor<cpu, 2, DType>& hnext = i ? htmp : hx;
+    const Tensor<cpu, 2, DType>& cnext = i ? c[i - 1] : cx;
+    #pragma omp parallel for


piiswrong · 2018-05-11T18:48:22Z

src/operator/rnn_impl.hpp

+  const int row = T * N;
+  const int col = H * 4;
+  for (int i = 0; i < row; ++i) {
+    #pragma omp parallel for


omp usage may not be efficient here. Operations in this loop is very simple while col usually is less than a few thousands

You are right. I will remove this omp temporarily and look for better optimization for this piece of code.

piiswrong · 2018-05-11T18:49:37Z

src/operator/rnn_impl.hpp

+  const DType beta1 = 1.0;
+  const int cell_size = N * H;
+  if (dhy_ptr != NULL) {
+    memcpy(dh.dptr_, dhy_ptr, cell_size * sizeof(DType));


why is copies needed?

piiswrong · 2018-05-11T18:51:28Z

tests/python/unittest/test_operator.py

+    data = mx.sym.Variable('data')
+
+    Y1, _ = cell1.unroll(T, data, layout='NTC', merge_outputs=True)
+    mod1 = mx.mod.Module(Y1, label_names=None, context=mx.cpu())


use default_context() here and remove the corresponding tests in test_operator_gpu. These tests will automatically be run again in test_operator_gpu with default_context() = gpu()

Fixed. Also I changed the name of test_lstm to test_lstm_sym since it would confict with that in /unittest/test_gluon_rnn.py after imported to test_operator_gpu.py.

pengzhao-intel · 2018-05-14T14:14:57Z

@eric-haibin-lin @piiswrong @szha @Jerryzcn the comments are solved. Please help take a review again.
After this PR is merged, we can rebase GRU PR and add dropout to LSTM/GRU soon.

piiswrong · 2018-05-14T19:19:39Z

src/operator/rnn_impl.h

+      DType ft = ifgo[i][j][k][1];
+      DType gt = ifgo[i][j][k][2];
+      DType ot = ifgo[i][j][k][3];
+      dh[j][k] += dy[t][j][k + offset];


dh and dc is never read before they are overwritten. Why do you need the copy at line 341?

piiswrong · 2018-05-14T19:21:28Z

src/operator/rnn_impl.h

+      }
+    }
+  }
+  memcpy(y_ptr, rs + y_offset, T * N * H * D * sizeof(DType));


why not write to y_ptr directly for the last layer?

piiswrong · 2018-05-14T19:22:46Z

src/operator/rnn_impl.h

+    }
+    Tensor<cpu, 2, DType> dyh(difgo[t].dptr_, Shape2(N, H * 4));
+    linalg_gemm(dyh, wh, dhnext, alpha, beta0, false, false);
+    linalg_gemm(dyh, hnext, dwh, alpha, beta1, true, false);


dwh is overwritten. why do you need to set it to 0 with memset at 328?

piiswrong · 2018-05-14T19:28:10Z

I can merge this first. But I think the memset and memcpy statements are superfluous. We should get ride of them later

@szha

* register RNN fused-API with nnvm, finish single-layer && undirection LSTM forward function * fix coding style and lint complains * add single-layer && undirectional LSTM backward function * make interface universal for other RNN mode * share intermediate result between forward and backward in a trick way * add comments for important parameters * modify testcase * Fix coding style and error message * fix openmp collapse error * fix const * remove rnn.cu and skip related testcases temporarily for building on GPU * support multi-layer and bidirectional for lstm inference * remove some testcaseS in test_gluon_rnn.py to build on GPU * remove testcase between fp32 and fp64 temporarily * retrigger ci * fix some logs * use a better way to share memory * fix cudnn registration * fix invariant calculations and enable some gpu testcases * add thread local cache for cudnn rnn op * add thread local cache for rnn op * fix bugs * remove some testcases to check segmentfault * remove cudnn registeration to check segmentfault * support multi-layer for LSTM Training * modify lstm testcase * add bidirectional support for lstm * fix gluon and coding style * fix bugs * remove nnvm registration * enable gpu testcases * add detailed descriptions * add dropout check * fix workspace size * dropout is not supported, add unit test for it * fix review comments

chenchu-zs requested a review from cjolivier01 as a code owner March 14, 2018 09:19

TaoLv reviewed Mar 14, 2018

View reviewed changes

eric-haibin-lin reviewed Mar 14, 2018

View reviewed changes

chenchu-zs changed the title ~~[WIP] Fused RNN implementation for CPU~~ [WIP][MXNET-107] Fused RNN implementation for CPU Mar 15, 2018

szha changed the base branch from master to cpu_fused_rnn March 16, 2018 07:14

chenchu-zs force-pushed the rnn_refactor branch 2 times, most recently from 1ee9ee3 to dde8e23 Compare March 22, 2018 02:40

marcoabreu changed the base branch from cpu_fused_rnn to master March 22, 2018 14:06

chenchu-zs requested review from marcoabreu and szha as code owners March 23, 2018 06:20

Merge remote-tracking branch 'upstream/master' into lstm

1471836

TaoLv added 2 commits May 9, 2018 22:09

dropout is not supported, add unit test for it

a52b5ef

Merge remote-tracking branch 'upstream/master' into lstm

a60de72

marcoabreu reviewed May 9, 2018

View reviewed changes

szha added this to In progress in gluon.rnn improvements May 9, 2018

szha changed the title ~~[WIP][MXNET-107] Fused RNN implementation for CPU~~ [WIP][MXNET-107] Fused LSTM implementation for CPU May 9, 2018

This was referenced May 9, 2018

gluon.rnn cells should use fused RNN operator #10871

Closed

gluon.rnn layers should use fused RNN operator and become HybridBlock #10873

Closed

piiswrong suggested changes May 11, 2018

View reviewed changes

TaoLv added 2 commits May 12, 2018 13:14

fix review comments

3c61b84

Merge remote-tracking branch 'upstream/master' into lstm

aeb8e9d

szha removed the request for review from thirdwing May 13, 2018 07:19

mratsim mentioned this pull request May 13, 2018

Overview of the fastest CPU RNNs implementation mratsim/Arraymancer#228

Closed

piiswrong reviewed May 14, 2018

View reviewed changes

piiswrong merged commit 275378a into apache:master May 14, 2018

gluon.rnn improvements automation moved this from In progress to Done May 14, 2018

szha mentioned this pull request May 15, 2018

fix rnn #10954

Merged

4 tasks

szha mentioned this pull request Jul 18, 2018

enable CPU kernel for all RNN layer forward #11807

Merged

4 tasks

Roshrini mentioned this pull request Aug 8, 2018

Run rnn backward fail #10264

Closed

pengzhao-intel mentioned this pull request Aug 27, 2018

update release news #12342

Merged

[WIP][MXNET-107] Fused LSTM implementation for CPU #10104

[WIP][MXNET-107] Fused LSTM implementation for CPU #10104

Conversation

chenchu-zs commented Mar 14, 2018 • edited

Description

Feature changes

New features

Unit-test changes

Performance

Opens

Checklist

Essentials

TaoLv commented Mar 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TaoLv commented Mar 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jerryzcn commented Mar 14, 2018

TaoLv commented Mar 15, 2018

TaoLv commented Mar 15, 2018

eric-haibin-lin commented Mar 15, 2018

Jerryzcn commented Mar 16, 2018

piiswrong commented Mar 16, 2018

szha commented Mar 16, 2018

TaoLv commented Mar 16, 2018

pengzhao-intel commented Mar 18, 2018

TaoLv commented Mar 22, 2018

piiswrong commented Mar 22, 2018

szha commented Mar 22, 2018

szha commented Mar 22, 2018

marcoabreu commented Mar 22, 2018 • edited

TaoLv commented Mar 22, 2018

marcoabreu commented Mar 22, 2018

TaoLv commented Mar 22, 2018

marcoabreu commented Mar 22, 2018

TaoLv commented May 8, 2018

piiswrong commented May 8, 2018

Choose a reason for hiding this comment

TaoLv May 9, 2018 • edited

Choose a reason for hiding this comment

TaoLv commented May 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pengzhao-intel commented May 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piiswrong commented May 14, 2018

chenchu-zs commented Mar 14, 2018 •

edited

marcoabreu commented Mar 22, 2018 •

edited

TaoLv May 9, 2018 •

edited