Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Backward doesn't work on LSTM with sequence_length #15268

Closed
Ishitori opened this issue Jun 18, 2019 · 8 comments
Closed

Backward doesn't work on LSTM with sequence_length #15268

Ishitori opened this issue Jun 18, 2019 · 8 comments

Comments

@Ishitori
Copy link
Contributor

Description

LSTM with out-of-the-box variable length was introduced in this PR. I tried to use it, and while the forward pass works well, the backward pass fails.

I provide minimum reproducible example. To my best knowledge, the backward pass is not covered with a unit test.

Environment info (Required)

The latest version with --pre

Package used (Python/R/Scala/Julia):
Python

Error Message:

MXNetError: [17:18:04] src/operator/./rnn-inl.h:1006: Check failed: in_data.size() == num_inputs (4 vs. 5) : 
Stack trace:
  [bt] (0) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4a157b) [0x7fdedb45957b]
  [bt] (1) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x507b9ad) [0x7fdee00339ad]
  [bt] (2) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x50b5cac) [0x7fdee006dcac]
  [bt] (3) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x396) [0x7fdedd6b3d36]
  [bt] (4) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext)+0x5d) [0x7fdedd6b43cd]
  [bt] (5) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x264c4f9) [0x7fdedd6044f9]
  [bt] (6) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2658961) [0x7fdedd610961]
  [bt] (7) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x265be70) [0x7fdedd613e70]
  [bt] (8) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x265c106) [0x7fdedd614106]

Minimum reproducible example

You have to use GPU to run it, as this feature is GPU only.
The devil is in the fact that the backward fails silently and mx.nd.waitall() is necessary at the end

import mxnet as mx
import numpy as np
from mxnet.gluon import nn
from mxnet.gluon.rnn import LSTM

ctx = mx.gpu(0)

label = mx.nd.array([1, 2, 3, 4, 5, 6, 7], ctx=ctx)
# random numbers, but with ones at the end as a padding symbol
x = mx.nd.array([[5434, 3232, 776, 323, 1, 1, 1], [4353, 545, 37, 23, 23, 545, 1]], ctx=ctx)

embedding = nn.Embedding(input_dim=6000,
                          output_dim=100,
                          weight_initializer=mx.initializer.Uniform(0.001))

lstm = LSTM(hidden_size=100,
            num_layers=1, dropout=0.2, bidirectional=True,
            use_sequence_length=True)

dense = nn.Dense(1)
l1 = mx.gluon.loss.L1Loss()

embedding.initialize(ctx=ctx)
lstm.initialize(ctx=ctx)
dense.initialize(ctx=ctx)


with mx.autograd.record():
    x_mask = x != 1
    x_len = mx.nd.sum(x_mask, axis=1).astype(np.int32)    
    state = lstm.begin_state(batch_size=x.shape[0], ctx=x.context)
    x_emb = embedding(x)
    x_emb = x_emb.transpose((1, 0, 2))
    a, _ = lstm(x_emb, states=state, sequence_length=x_len)
    out = dense(a)
    loss = l1(out, label)
    # this prints the loss, showing that forward pass works fine
    print(loss)

# this one will fail
loss.backward()
mx.nd.waitall()
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

@Ishitori
Copy link
Contributor Author

@stephenrawls, any help with that?

@leleamol
Copy link
Contributor

@mxnet-label-bot add [Bug, Gluon]

@leleamol
Copy link
Contributor

I could reproduce this issue. Here is a full callstack.

ubuntu@ip-172-31-31-181:~$ python lstm_test.py

[1.0000387 2.0000887 3.000114 4.000115 5.000068 6.0000405 7. ]
<NDArray 7 @gpu(0)>
Traceback (most recent call last):
File "lstm_test.py", line 42, in
mx.nd.waitall()
File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 166, in waitall
check_call(_LIB.MXNDArrayWaitAll())
File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [22:04:02] src/operator/./rnn-inl.h:1006: Check failed: in_data.size() == num_inputs (4 vs. 5) :
Stack trace:
[bt] (0) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7ffa222a1082]
[bt] (1) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::RNNOp<mshadow::gpu, float, float>::Backward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x1dc) [0x7ffa26c335bc]
[bt] (2) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::op::RNNStatefulGradComputemshadow::gpu(mxnet::OpStatePtr const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)+0x21b6) [0x7ffa26c68186]
[bt] (3) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<unsigned int, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x1333) [0x7ffa24680473]
[bt] (4) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<unsigned int, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x1d) [0x7ffa2468173d]
[bt] (5) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::ThreadedEngine::BulkFlush()::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&)+0x1ec) [0x7ffa24e3f3dc]
[bt] (6) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x945) [0x7ffa24e42c35]
[bt] (7) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0x11d) [0x7ffa24e5a64d]
[bt] (8) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock
, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent&&)+0x4e) [0x7ffa24e5a8fe]

@stephenrawls
Copy link
Contributor

Looking at this now.

@stephenrawls
Copy link
Contributor

I think I have a solution, at least I have tested locally and it appears to work.

Just need to update unit tests and then I will file a PR.

@roywei
Copy link
Member

roywei commented Jun 19, 2019

Thanks @Ishitori for the catch!
Hi @stephenrawls, could you tag @szha and me in your PR? We would like to include your fix in MXNet 1.5.0 release. This feature is already included in 1.5.0.rc1
Thanks!

@leleamol
Copy link
Contributor

I verified that the issue is fixed in the latest code.

@lanking520 I would recommend closing this issue.

@szha szha closed this as completed Jun 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants