Backward doesn't work on LSTM with sequence_length #15268

Ishitori · 2019-06-18T18:31:26Z

Description

LSTM with out-of-the-box variable length was introduced in this PR. I tried to use it, and while the forward pass works well, the backward pass fails.

I provide minimum reproducible example. To my best knowledge, the backward pass is not covered with a unit test.

Environment info (Required)

The latest version with --pre

Package used (Python/R/Scala/Julia):
Python

Error Message:

MXNetError: [17:18:04] src/operator/./rnn-inl.h:1006: Check failed: in_data.size() == num_inputs (4 vs. 5) : 
Stack trace:
  [bt] (0) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4a157b) [0x7fdedb45957b]
  [bt] (1) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x507b9ad) [0x7fdee00339ad]
  [bt] (2) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x50b5cac) [0x7fdee006dcac]
  [bt] (3) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x396) [0x7fdedd6b3d36]
  [bt] (4) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext)+0x5d) [0x7fdedd6b43cd]
  [bt] (5) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x264c4f9) [0x7fdedd6044f9]
  [bt] (6) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2658961) [0x7fdedd610961]
  [bt] (7) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x265be70) [0x7fdedd613e70]
  [bt] (8) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x265c106) [0x7fdedd614106]

Minimum reproducible example

You have to use GPU to run it, as this feature is GPU only.
The devil is in the fact that the backward fails silently and mx.nd.waitall() is necessary at the end

import mxnet as mx
import numpy as np
from mxnet.gluon import nn
from mxnet.gluon.rnn import LSTM

ctx = mx.gpu(0)

label = mx.nd.array([1, 2, 3, 4, 5, 6, 7], ctx=ctx)
# random numbers, but with ones at the end as a padding symbol
x = mx.nd.array([[5434, 3232, 776, 323, 1, 1, 1], [4353, 545, 37, 23, 23, 545, 1]], ctx=ctx)

embedding = nn.Embedding(input_dim=6000,
                          output_dim=100,
                          weight_initializer=mx.initializer.Uniform(0.001))

lstm = LSTM(hidden_size=100,
            num_layers=1, dropout=0.2, bidirectional=True,
            use_sequence_length=True)

dense = nn.Dense(1)
l1 = mx.gluon.loss.L1Loss()

embedding.initialize(ctx=ctx)
lstm.initialize(ctx=ctx)
dense.initialize(ctx=ctx)


with mx.autograd.record():
    x_mask = x != 1
    x_len = mx.nd.sum(x_mask, axis=1).astype(np.int32)    
    state = lstm.begin_state(batch_size=x.shape[0], ctx=x.context)
    x_emb = embedding(x)
    x_emb = x_emb.transpose((1, 0, 2))
    a, _ = lstm(x_emb, states=state, sequence_length=x_len)
    out = dense(a)
    loss = l1(out, label)
    # this prints the loss, showing that forward pass works fine
    print(loss)

# this one will fail
loss.backward()
mx.nd.waitall()

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-06-18T18:31:32Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

Ishitori · 2019-06-18T18:34:12Z

@stephenrawls, any help with that?

leleamol · 2019-06-18T20:39:58Z

@mxnet-label-bot add [Bug, Gluon]

leleamol · 2019-06-18T22:12:50Z

I could reproduce this issue. Here is a full callstack.

ubuntu@ip-172-31-31-181:~$ python lstm_test.py

[1.0000387 2.0000887 3.000114 4.000115 5.000068 6.0000405 7. ]
<NDArray 7 @gpu(0)>
Traceback (most recent call last):
File "lstm_test.py", line 42, in
mx.nd.waitall()
File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 166, in waitall
check_call(_LIB.MXNDArrayWaitAll())
File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [22:04:02] src/operator/./rnn-inl.h:1006: Check failed: in_data.size() == num_inputs (4 vs. 5) :
Stack trace:
[bt] (0) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7ffa222a1082]
[bt] (1) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::RNNOp<mshadow::gpu, float, float>::Backward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x1dc) [0x7ffa26c335bc]
[bt] (2) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::op::RNNStatefulGradComputemshadow::gpu(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x21b6) [0x7ffa26c68186]
[bt] (3) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<unsigned int, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x1333) [0x7ffa24680473]
[bt] (4) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<unsigned int, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x1d) [0x7ffa2468173d]
[bt] (5) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::ThreadedEngine::BulkFlush()::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&)+0x1ec) [0x7ffa24e3f3dc]
[bt] (6) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x945) [0x7ffa24e42c35]
[bt] (7) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0x11d) [0x7ffa24e5a64d]
[bt] (8) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent&&)+0x4e) [0x7ffa24e5a8fe]

stephenrawls · 2019-06-18T22:49:43Z

Looking at this now.

stephenrawls · 2019-06-19T05:49:29Z

I think I have a solution, at least I have tested locally and it appears to work.

Just need to update unit tests and then I will file a PR.

roywei · 2019-06-19T06:00:06Z

Thanks @Ishitori for the catch!
Hi @stephenrawls, could you tag @szha and me in your PR? We would like to include your fix in MXNet 1.5.0 release. This feature is already included in 1.5.0.rc1
Thanks!

leleamol · 2019-06-20T20:06:52Z

I verified that the issue is fixed in the latest code.

@lanking520 I would recommend closing this issue.

marcoabreu added Bug Gluon labels Jun 18, 2019

stephenrawls mentioned this issue Jun 19, 2019

fixing var-seq-len rnn backward() operator #15278

Merged

szha closed this as completed Jun 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backward doesn't work on LSTM with sequence_length #15268

Backward doesn't work on LSTM with sequence_length #15268

Ishitori commented Jun 18, 2019

mxnet-label-bot commented Jun 18, 2019

Ishitori commented Jun 18, 2019

leleamol commented Jun 18, 2019

leleamol commented Jun 18, 2019

stephenrawls commented Jun 18, 2019

stephenrawls commented Jun 19, 2019

roywei commented Jun 19, 2019 •

edited

leleamol commented Jun 20, 2019

Backward doesn't work on LSTM with sequence_length #15268

Backward doesn't work on LSTM with sequence_length #15268

Comments

Ishitori commented Jun 18, 2019

Description

Environment info (Required)

Error Message:

Minimum reproducible example

mxnet-label-bot commented Jun 18, 2019

Ishitori commented Jun 18, 2019

leleamol commented Jun 18, 2019

leleamol commented Jun 18, 2019

stephenrawls commented Jun 18, 2019

stephenrawls commented Jun 19, 2019

roywei commented Jun 19, 2019 • edited

leleamol commented Jun 20, 2019

roywei commented Jun 19, 2019 •

edited