Hybridization SequenceLast Bug #13968

stephenrawls · 2019-01-23T06:29:32Z

Description

I am experiencing a crash related to hybridization and the SequenceLast operator. It is currently consistently crashing for me, but others in team haven't been able to reproduce.

Environment info (Required)

----------Python Info----------
Version      : 3.7.2
Compiler     : Clang 9.0.0 (clang-900.0.39.2)
Build        : ('default', 'Dec 27 2018 07:35:45')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.1
Directory    : /usr/local/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.0
Directory    : /usr/local/lib/python3.7/site-packages/mxnet
Commit Hash   : b3be92f4a48bce62a5a8424271871c2f81c8f7f1
----------System Info----------
Platform     : Darwin-16.7.0-x86_64-i386-64bit
system       : Darwin
node         : dca90487206c.ant.amazon.com
release      : 16.7.0
version      : Darwin Kernel Version 16.7.0: Sun Oct 28 22:30:19 PDT 2018; root:xnu-3789.73.27~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0341 sec, LOAD: 0.6150 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0352 sec, LOAD: 0.2739 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0428 sec, LOAD: 0.2802 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0420 sec, LOAD: 0.3961 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0305 sec, LOAD: 0.8088 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0337 sec, LOAD: 0.1836 sec.

I'm using Python

Unfortunately I don't have a minimal reproducible example yet. Every time I try, the bug does not occur in my condensed example.

The problem I have is that I am seeing the following error message:

 File ".../lib/python3.4/site-packages/mxnet/base.py", line 251, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [23:04:29] .../src/src/imperative/imperative_utils.cc:58: Check failed: !ndinputs.back()->is_none() node_264 0
Stack trace returned 10 entries:
[bt] (0) .../lib/libmxnet.so(dmlc::StackTrace()+0x3d) [0x7ff375cd4e0d]
[bt] (1) .../lib/libmxnet.so(mxnet::imperative::RunGraph(bool, nnvm::IndexedGraph const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> >, unsigned long, unsigned long, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> >&&, std::vector<unsigned int, std::allocator<unsigned int> >&&, std::vector<mxnet::OpStatePtr, std::allocator<mxnet::OpStatePtr> >*, std::vector<mxnet::DispatchMode, std::allocator<mxnet::DispatchMode> > const&, bool)+0x5a4) [0x7ff375e4a924]
[bt] (2) .../lib/libmxnet.so(mxnet::Imperative::Backward(std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, bool, bool, bool)+0x22fc) [0x7ff375e3969c]
[bt] (3) .../lib/libmxnet.so(MXAutogradBackwardEx+0x2e0) [0x7ff375cfe640]

When I print out the symbolic debug string for my network and go to "node_264" it leads me to the following section of my model code:

        last_real_tag = mx.nd.SequenceLast(
            tags_t, sequence_length=mx.nd.cast(valid_lengths,dtype=tags_t.dtype), use_sequence_length=True, axis=0)

The things that I think are important about this are:
(1) The tags_t array is an integer sequence that we generate from ground truth data. So this SequenceLast call shouldn't be trying to flow any gradients back through it. (Note that the stack trace is in the backward pass)
(2) The model we run is a HybridBlock, but this particular code is part of our loss function which is not a hybrid block. It takes the output from the model, and combines it with the ground truth values from this SequenceLast operation, and calculates our loss
(3) Everything worked fine when our model was a regular gluon Bock. It works fine as a HybridBlock as long as we don't hybridize() it. But when I do hybridize() the model, I am consistently seeing this crash. (Although other team members have yet to run into it).

We currently have a work around to add the following statement directly after the SequenceLast operation:

last_real_tag = last_real_tag.detach()

When this is in place I never encounter the crash I highlighted above.

Additionally I have another hackier work-around of not using SequenceLast, but instead using a combination of Take and diag to perform the same function. When I do that I also do not ever see the crash.

I'm happy to provide further info as needed, apologies for not being able to condense down into an actual reproducible crash.

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-01-23T06:29:35Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels:

frankfliu · 2019-01-23T17:44:28Z

@mxnet-label-bot add [Gluon, operator]

szha · 2019-01-27T05:50:29Z

@stephenrawls thanks for reporting the bug. If it's hard to reproduce the error, you may consider setting this environment variable: MXNET_ENGINE_TYPE=NaiveEngine. This causes all computation to be synchronous and the stacktrack should reveal where the error is raised.

stephenrawls · 2019-01-27T06:29:48Z

@szha thanks for the tip. The stack trace is revealing the error happens in the backward pass at a specific line checking that a certain array is non-empty, which is confirmed by our work around of detaching the problematic variable from the computation graph. (And since the variable in question shouldn't have any gradient flowing through it anyway, this does not effect our calculations and is an effective work around).

But I'll play around with MXNET_ENGINE_TYPE and see if I can get a more reproducible error.

piyushghai · 2019-04-09T20:50:34Z

@stephenrawls Can you provide a minimum reproducible example so that we can help nail down this issue ?

piyushghai · 2019-06-05T22:18:29Z

Hi @stephenrawls, Were you able to get a reproducible example for this error ?

stephenrawls · 2019-06-06T23:34:40Z

No I wasn't, and our code base has moved on, so I'm closing the issue. Thanks.

marcoabreu added Gluon Operator labels Jan 23, 2019

szha added the Bug label Jan 27, 2019

ChaiBapchya mentioned this issue Feb 20, 2019

Is the bot learning? MXNetEdge/mxnet-infrastructure#66

Open

stephenrawls closed this as completed Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybridization SequenceLast Bug #13968

Hybridization SequenceLast Bug #13968

stephenrawls commented Jan 23, 2019

mxnet-label-bot commented Jan 23, 2019

frankfliu commented Jan 23, 2019 •

edited

szha commented Jan 27, 2019

stephenrawls commented Jan 27, 2019

piyushghai commented Apr 9, 2019

piyushghai commented Jun 5, 2019

stephenrawls commented Jun 6, 2019

Hybridization SequenceLast Bug #13968

Hybridization SequenceLast Bug #13968

Comments

stephenrawls commented Jan 23, 2019

Description

Environment info (Required)

mxnet-label-bot commented Jan 23, 2019

frankfliu commented Jan 23, 2019 • edited

szha commented Jan 27, 2019

stephenrawls commented Jan 27, 2019

piyushghai commented Apr 9, 2019

piyushghai commented Jun 5, 2019

stephenrawls commented Jun 6, 2019

frankfliu commented Jan 23, 2019 •

edited